<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rafael Ferreira da Silva</string-name>
          <email>rafsilva@isi.edu</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rosa Filgueira</string-name>
          <email>rosa@bgs.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ewa Deelman</string-name>
          <email>deelman@isi.edu</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erola Pairo-Castineira</string-name>
          <email>Erola.Pairo-Castineira@igmm.ed.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Michael Overton</string-name>
          <email>ian.overton@ed.ac.uk</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malcolm Atkinson</string-name>
          <email>malcolm.atkinson@ed.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>British Geological Survey, Lyell Centre</institution>
          ,
          <addr-line>Edinburgh EH14 4AP</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MRC Institute of Genetics and Molecular Medicine, University of Edinburgh</institution>
          ,
          <addr-line>Edinburgh</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Informatics, University of Edinburgh</institution>
          ,
          <addr-line>Edinburgh EH8 9LE</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Southern California, Information Sciences Institute</institution>
          ,
          <addr-line>Marina Del Rey, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Usher Institute of Population Health Sciences and Informatics, University of Edinburgh</institution>
          ,
          <addr-line>Edinburgh</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>15</fpage>
      <lpage>24</lpage>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Scienti c work ows have become mainstream for conducting
large-scale scienti c research. As a result, many work ow
applications and Work ow Management Systems (WMSs)
have been developed as part of the cyberinfrastructure to
allow scientists to execute their applications seamlessly on
a range of distributed platforms. In spite of many success
stories, a key challenge for running work ows in distributed
systems is failure prediction, detection, and recovery. In
this paper, we propose an approach to use control theory
developed as part of autonomic computing to predict
failures before they happen, and mitigated them when possible.
The proposed approach applying the
proportional-integralderivative controller (PID controller) control loop
mechanism, which is widely used in industrial control systems, to
mitigate faults by adjusting the inputs of the controller. The
PID controller aims at detecting the possibility of a fault far
enough in advance so that an action can be performed to
prevent it from happening. To demonstrate the feasibility of
the approach, we tackle two common execution faults of the
Big Data era|data storage overload and memory over ow.
We de ne, implement, and evaluate simple PID controllers
to autonomously manage data and memory usage of a
bioinformatics work ow that consumes/produces over 4.4TB of
data, and requires over 24TB of memory to run all tasks
concurrently. Experimental results indicate that work ow
executions may signi cantly bene t from PID controllers,
in particular under online and unknown conditions.
Simulation results show that nearly-optimal executions (slowdown
of 1.01) can be attained when using our proposed method,
and faults are detected and mitigated far in advance of their
occurence.
Scienti c work ows, Fault detection and handling,
Autonomic computing</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>Scientists want to extract the maximum information out
of their data|which are often obtained from scienti c
instruments and processed in large-scale distributed systems.</p>
      <p>
        Scienti c work ows are a mainstream solution to process
large-scale scienti c computations in distributed systems,
and have supported traditional and breakthrough researches
across several domains [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]. In spite of impressive
achievements today, failure prediction, detection, and recovery are
still a major challenge in workload management in distributed
system, both at the application and resource levels. Failures
a ect the turnaround time of the applications, and that of
the umbrella analysis and therefore the productivity of the
scientists that depend on the power of distributed
computing to do their work.
      </p>
      <p>In this work, we investigate how the
proportional-integralderivative controller (PID controller) control loop
mechanism, which is widely used in industrial systems, can be
applied to predict and prevent failures in end-to-end work ow
executions across distributed, heterogeneous computational
environments. The basic idea behind a PID controller is to
read data from a sensor, then compute the desired
actuator output by calculating proportional (P), integral (I), and
derivative (D) responses and summing those three
components to compute the output. Each of the components can
often be interpreted as the present error (P), the
accumulation of past errors (I), and a prediction of future errors
(D), based on current rate of change. The main advantage
of using a PID controller is that the control loop
mechanism progressively monitors the evolution of the work ow
execution, detecting possible faults before they occur, and
when needed performs actions that lead the execution to a
steady-state.</p>
      <p>The main contributions of this paper include:
1. The evaluation of PID controllers to prevent and
mitigate two major problems of the Big Data era: data
storage overload and memory over ow;
2. The characterization of a bioinformatics work ow, which
consumes/produces over 4.4TB of data, and requires
over 24TB of memory;
3. An experimental evaluation via simulation to
demonstrate the feasibility of the proposed approach using
simple PID controllers; and
4. A performance optimization study to tune the
parameters of the control loop to provide nearly-optimal
workow executions, where faults are detected and handled
Copyright held by the author(s).
far in advance of their occurence.
2.</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        Several o ine strategies and techniques were developed
to detect and handle failures during scienti c work ow
executions [
        <xref ref-type="bibr" rid="ref24 ref27 ref28 ref3 ref36 ref5">3, 5, 24, 27, 28, 36</xref>
        ]. Autonomic online methods were
also proposed to cope with work ow failures at runtime,
for example by providing checkpointing [
        <xref ref-type="bibr" rid="ref20 ref25 ref30">20, 25, 30</xref>
        ],
provenance [
        <xref ref-type="bibr" rid="ref13 ref25">13, 25</xref>
        ], task resubmission [
        <xref ref-type="bibr" rid="ref10 ref31">10, 31</xref>
        ], and task
replication [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ], among others. However, these systems do not
aim to prevent faults, but mitigate them, and although task
replication may increase the probability of having a
successful execution in another computing resource, it should
be used sparingly to avoid overloading the execution
platform [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. These system also make strong assumptions about
resource and application characteristics. Although several
works address task requirement estimations based on
provenance data [
        <xref ref-type="bibr" rid="ref11 ref18 ref22 ref29">11,18,22,29</xref>
        ], accurate estimations are still
challenging, and may be speci c to a certain type of application.
In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a prediction algorithm based on machine learning
(Nave Bayes classi er) is proposed to identify faults before
they occur, and to apply preventive actions to mitigate the
faults. Experimental results show that faults can be
predicted up to 94% of accuracy, however the approach is tied
to a small set of applications, and it is assumed that the
application requirements do not change over time. In previous
works, we proposed an autonomic method described as a
MAPE-K loop to cope with online non-clairvoyant work ow
executions faults on grids [
        <xref ref-type="bibr" rid="ref15 ref17">15, 17</xref>
        ], where unpredictability is
addressed by using a-priori knowledge extracted from
execution traces to identify severity levels of faults, and apply
a speci c set of actions. Although this is the rst work on
self-healing of work ow executions in online and unknown
conditions, experimental results on a real platform show an
important improvement of the QoS delivered by the system.
However, the method does not prevent faults from
happening (actions are performed once faults are detected). In [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
a machine learning approach based in inductive logic
programming is proposed for fault prediction and diagnosis in
grids. This approach is limited to small scale applications
and a few parameters|the number of rules may
exponentially increase as the number of tasks in a work ow or the
accounted parameters increase.
      </p>
      <p>To the best of our knowledge, this is the rst work that
uses PID controllers to mitigate faults in scienti c work ow
executions under online and unknown conditions.</p>
    </sec>
    <sec id="sec-4">
      <title>PID CONTROLLERS</title>
      <p>
        The keystone component of the proposed process is the
proportional-integral-derivative controller (PID controller) [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]
control loop mechanism, which is widely used in industrial
control systems, to mitigate faults by adjusting the process
control inputs. Examples of such systems are the ones where
the temperature, pressure, or the ow rate, need to be
controlled. In such scenarios, the PID controller aims at
detecting the possibility of a fault far enough in advance so that
an action can be performed to prevent it from happening.
Figure 1 shows the general PID control system loop. The
setpoint is the desired or command value for the process
variable. The control system algorithm uses the di erence
between the output (process variable) and the setpoint to
determine the desired actuator input to drive the system.
      </p>
      <p>The control system performance is measured through a
step function as a setpoint command variable, and the
response of the process variable. The response is quanti ed
by measuring de ned waveform characteristics as shown in
Figure 2. Raise time is the amount of time the system takes
to go from about 10% to 90% of the steady-state, or nal,
value. Percent overshoot is the amount that the process
variable surpasses the nal value, expressed as a
percentage of the nal value. Settling time is the time required for
the process variable to settle to within a certain
percentage (commonly 5%) of the nal value. Steady-state error is
the nal di erence between the process variable and the
setpoint. Dead time is a delay between when a process variable
changes, and when that change can be observed.</p>
      <p>Process variables (output) are determined by fault-speci c
metrics quanti ed online. The setpoint is constant and
dened as 1. The output of the PID controller is an input
value for a Curative Agent, which determines whether an
action should be performed (Figure 3). Negative input
values mean the control system is raising too fast and may tend
to the overshoot state (i.e., a faulty state), therefore
preventive or corrective actions should be performed. Actions may
include task pre-emption, task resubmission, task clustering,
task cleanup, storage management, etc. In contrast, positive
input values mean that the control system is smoothly
rising to the steady state. The control signal u(t) (output) is
de ned as follows:
u(t) = Kpe(t) + Ki</p>
      <p>Z t
0</p>
      <p>
        de(t)
e(t)dt + Kd dt ;
(1)
where Kp is the proportional gain constant, Ki is the integral
gain constant, Kd is the derivative gain constant, and e is
the error de ned as the di erence between the setpoint and
the process variable value.
Tuning the proportional (Kp), integral (Ki), and
derivative (Kd) gain constants is challenging and a research topic
by itself. Therefore, in this paper we initially assume Kp =
Ki = Kd = 1 for the sake of simplicity and to demonstrate
the feasibility of the process, and then we use the
ZieglerNichols closed loop method [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] for tuning the PID
controllers (see Section 6).
      </p>
    </sec>
    <sec id="sec-5">
      <title>DEFINING PID CONTROLLERS</title>
      <p>In our proposed approach, a PID controller is de ned and
used for each possible-future fault identi ed from workload
traces (historical data). In some cases, a particular type of
faults cannot be modeled as a full PID controller. For
example, there are faults that cannot be predicted far in advance
(e.g., unavailability of resources due to a power cut). In this
case, a PI (proportional-integral ) controller can be de ned
and deployed. In production systems, a large number of
controllers may be de ned and used to control, for example,
CPU utilization, network bandwidth, etc. In this paper, we
demonstrate the feasibility of the use of PID controllers by
tackling two common issues of work ow executions: data
and memory over ow.
4.1</p>
      <p>
        Workflow Data Footprint and Management
In the era of Big Data Science, applications are producing
and consuming ever-growing data sets. A run of scienti c
work ows that manipulate these data sets may lead the
system to an out of disk space fault if no mechanisms are in
place to control how the available storage is used. To prevent
this, data cleanup tasks are often automatically inserted into
the work ow by the work ow management system [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], or
the number of concurrent task executions is limited to
prevent data usage over ow. Cleanup tasks remove data sets
that are no longer needed by downstream tasks, but
nevertheless they add an important overhead to the work ow
execution [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>PID Controller. The controller process variable (output)
is de ned as the ratio of the estimated disk space required
by current tasks in execution, and the actual available disk
space. The system is in a non-steady state if the total
amount of disk space consumed is above (overshoot) a
prede ne threshold (setpoint ), or the amount of used disk space
is below the setpoint. The proportional (P) response is
computed as the error between the setpoint, and the actual used
disk space; the integral (I) response is computed from the
sum of the disk usage errors (cumulative value of the
proportional responses); and the derivative (D) response is
computed as the di erence between the current and the previous
disk over ow (or underutilization) error values.</p>
      <p>Corrective Actions. The output of the PID controller
(control signal u(t), Equation 1) indicates whether the
system is in a non-steady state. Negative values indicate that
the current disk usage is above the threshold of the minimum
required available disk space (a safety measure to avoid an
unrecoverable state). In contrast, positive values indicate
that the current running tasks do not maximize disk usage.
For values of u(t) &lt; 0, (1) data cleanup tasks can be
triggered to remove unused intermediate data (adding cleanup
tasks may imply rearranging the priority of all tasks in the
queue), or (2) tasks can be preempted due to the inability
to remove data|the inability of cleaning up data may lead
the execution to an unrecoverable state, and thereby to a
failed execution. Otherwise (for u(t) &gt; 0), the number of
concurrent task executions may be increased.
4.2</p>
      <p>Workflow Memory Usage and Management
Large scienti c computing applications rely on complex
work ows to analyze large volume of data. These tasks are
often running in HPC resources over thousands of CPU cores
and simultaneously performing data accesses, data
movements, and computation, dominated by memory-intensive
operations (e.g., reading a large volume of data from disk,
decompressing in memory massive amount of data or
performing a complex calculation which generates large datasets,
etc.). The performance of those memory-intensive
operations are quite often limited by the memory capacity of the
resource where the application is being executed. Therefore,
if those operations over ow the physical memory limit it can
result in application performance degradation or application
failure. Typically, the end-user is responsible for optimizing
the application, modifying the code if it is needed for
complying with the amount of memory that can be used on
that resource. This work addresses the memory challenge
proposing an in-situ analysis of memory usage, to adapt
the number of concurrent tasks executions according to the
memory usage required by an application at runtime.
PID Controller. The controller process variable (output)
is de ned as the ratio of the estimated total peak memory
usage required by current tasks in execution, and the actual
available memory. The system is in a non-steady state if
the amount of memory available is below the setpoint, or if
the current available memory is above it. The proportional
(P) response is computed as the error between the
memory consumption setpoint value, and the actual memory
usage; the integral (I) response is computed from cumulative
proportional responses (previous memory usage errors); and
the derivative (D) response is computed as the di erence
between the current and the previous memory over ow (or
underutilization) error values.</p>
      <p>Corrective Actions. Negative values for the control signal
u(t) indicate that the ensemble of running tasks are leading
the system to an over ow state, thus some tasks should be
preempted to prevent the system to run out of memory. For
positive u(t) values, the memory consumption of current
running tasks is below a prede ned memory consumption
setpoint. Therefore, the work ow management system may
spawn additional tasks for concurrent execution.
5.
5.1</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL EVALUATION</title>
    </sec>
    <sec id="sec-7">
      <title>Scientific Workflow Application</title>
      <p>
        The 1000 genomes project provides a reference for human
variation, having reconstructed the genomes of 2,504
individuals across 26 di erent populations [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The test case
used in this work identi es mutational overlaps using data
from the 1000 genomes project in order to provide a null
distribution for rigorous statistical evaluation of potential
disease-related mutations. This test case (Figure 4) has been
implemented as a Pegasus [
        <xref ref-type="bibr" rid="ref14 ref2">2, 14</xref>
        ] work ow, and is composed
of ve di erent tasks:
Individuals. This task fetches and parses the Phase 3
data [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] from the 1000 genomes project per chromosome.
These les list all of Single nucleotide polymorphisms (SNPs)
variants in that chromosome and which individuals have
      </p>
      <p>Individuals
Input Data 1000 Genome i3</p>
      <p>Populations pop2</p>
      <p>Sifting sh3</p>
      <p>Data
Preparation
each one. An individual task creates output les for each
individual of rs numbers, where individuals have mutations
on both alleles.</p>
      <p>
        Populations. The 1000 genome project has 26 di erent
populations from many di erent locations worldwide [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The populations task fetches and parses ve super
populations (African, Mixed American, East Asian, European,
and South Asian), and a set of all inviduals.
      </p>
      <p>Sifting. This task computes the SIFT scores of all of the
SNPs variants, as computed by the Variant E ect Predictor
(VEP ). SIFT is a sequence homology-based tool that Sorts
Intolerant From Tolerant amino acid substitutions, and
predicts whether an amino acid substitution in a protein will
have a phenotypic e ect. VEP determines the e ect of
individual variants on genes, transcripts, and protein sequence,
as well as regulatory regions. For each chromosome, the
sifting task processes the corresponding VEP, and selects
only the SNPs variants that have a SIFT score.
Pair Overlap Mutations. This task measures the overlap
in mutations (SNPs) among pairs of individuals.
Considering two individuals, if both individuals have a given SNP
then they have a mutation overlap. It performs several
correlations including di erent number of pair of individuals, and
di erent number of SNPs variants (only the SNPs variants
with a score less than 0.05, and all the SNPs variants); and
computes an array (per chromosome, population, and SIFT
level selected), which has as many entries as individuals|
each entry contains the list of SNPs variants per individual
according to the SIFT score.</p>
      <p>Frequency Overlap Mutations. This task calculates the
frequency of overlapping mutations across n subsamples of j
individuals. For each run, the task randomly selects a group
of 26 individuals from this array and computes the number
of overlapping in mutations among the group. Then, the
individuals task computes the frequency of mutations that
have the same number of overlapping mutations.
5.2</p>
    </sec>
    <sec id="sec-8">
      <title>Workflow Characterization</title>
      <p>We pro led the 1000 genome sequencing analysis
work</p>
      <p>
        ow using the Kickstart [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] pro ling tool. Kickstart
monitors and records task execution in scienti c work ows (e.g.,
process I/O, runtime, memory usage, and CPU utilization).
Runs were conducted on the Eddie Mark 3, which is the third
iteration of the University of Edinburgh's compute cluster.
The cluster is composed of 4,000+ cores with up to 2 TB
of memory. For running the characterization experiments,
we have used three types of nodes, depending of the size of
memory required for each task:
1. 1 Large node with 2 TB RAM, 32 cores, Intel R Xeon R
Processor E5-2630 v3 (2.4 GHz), for running the
individual tasks;
2. 1 Intermediate node with 192GB RAM, 16 cores, Intel R
Xeon R Processor E5-2630 v3 (2.4 GHz), for running
the sifting tasks;
3. 2 Standards nodes with 64 GB RAM, 32 cores, Intel R
Xeon R Processor E5-2630 v3 (2.4 GHz), for running
the remaining tasks.
      </p>
      <p>Table 1 shows the execution pro le of the work ow. Most
of the work ow execution time is allocated to the
individual tasks. These tasks are in the critical path of the
workow due to their high demand of disk (174GB in average per
task) and memory (411GB in average per task). The total
work ow data footprint is about 4.4TB. Although the large
node provides 2 TB of RAM and 32 cores, we would only be
able to run up to 4 concurrent tasks per node. In Eddie Mark
3, the standard disk quota is 2GB per user, and 200GB per
group. Since this quota would not su ce to run all tasks
of the 1000 genome sequencing analysis work ow (even if
all tasks run sequentially), we had a special arrangement
to increase our quota to 500GB. Note that this increased
quota allow us to barely run 3 concurrent individual tasks
in the large node, and some of the remaining tasks in smaller
nodes. Therefore, data and memory management are crucial
to perform a successful run of the work ow, while increasing
user satisfaction.
5.3</p>
    </sec>
    <sec id="sec-9">
      <title>Experiment Conditions</title>
      <p>
        The experiments use trace-based simulation. Since most
work ow simulators are event-based [
        <xref ref-type="bibr" rid="ref16 ref6">6,16</xref>
        ], we developed an
activity-based simulator to simulate every time slice of the
PID controllers behaviors (which is available online [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]).
The simulator provides support for task scheduling and
resource provisioning at the work ow level. The simulated
computing environment represents the three nodes from the
Eddie Mark 3 cluster described in Section 5.2 (total 80 CPU
cores). Additionally, we assume a shared network le system
among the nodes with total capacity of 500GB.
      </p>
      <p>
        We use an FCFS policy with task preemption and
backll for task scheduling|tasks submitted at the same time
are randomly chosen, and preempted tasks return to the top
of the queue. To avoid unrecoverable faults due to run out
of disk space, we implemented a data cleanup mechanism
to remove data that are no longer required by downstream
tasks [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. Data cleanup tasks are only triggered if the
maximum storage capacity is reached. In this case, all running
tasks are preempted, the data cleanup task is executed, and
the work ow resumes its execution. Note that this
mechanism may add a signi cant overhead to the work ow
execution.
      </p>
      <p>The goal of this experiment is to ensure that correctly
dened executions complete, that performance is acceptable,
and that possible-future faults are quickly detected and
auRuntime
Mean (s) Std. Dev.</p>
      <p>Data Footprint
Mean (GB) Std. Dev.</p>
      <p>Memory Peak
Mean (GB) Std. Dev.
tomatically handled before they lead the work ow execution
to an unrecoverable state (measured by the number of data
cleanup tasks used). Therefore, we do not attempt to
optimize task preemption (which criteria should be used to
select tasks for removal, or perform checkpointing) since our
goal is to demonstrate the feasibility of the approach with
simple use case scenarios.</p>
      <p>
        Composing PID Controllers. The response variable of
the control loop that leads the system to a setpoint (or
within a steady-state error) is de ned as waveforms, which
can be composed of overshoots or underutilization of the
system. In order to accommodate overshoots, we arbitrarily
dene our setpoint as 80% of the maximum total capacity (for
both storage and memory usage), and a steady-state error
of 5%. For this experiment we assume Kp = Ki = Kd = 1
to demonstrate the feasibility of the approach regardless the
use of tuning methods. A single PID controller ud is used
to manage disk usage (shared network le system), while
an independent memory controller unm is deployed for each
computing node n. The controller input value indicates the
amount of disk space or memory that should be consumed by
tasks. If the input value is positive, more tasks are scheduled
(resp. tasks are preempted). When managing a set of
controllers, it is important to ensure that an action performed
by a controller does not counteract an action performed by
another one. In this paper, the decision on the number of
tasks to be scheduled/preempted is computed as the min
between the response value of the unique disk usage PID
controller, and the memory PID controller per resource, i.e.,
min(ud; unm). The control loop process uses then the mean
values presented in Table 1 to estimate the number of tasks
to be scheduled/preempted. Note that due to the high
values of standard deviation, estimations may not be accurate.
Task characteristics estimation is beyond the scope of this
work, and sophisticated methods to provide accurate
estimates can be found in [
        <xref ref-type="bibr" rid="ref11 ref18 ref22 ref29">11, 18, 22, 29</xref>
        ]. However, this work
intends to demonstrate that even using inaccurate
estimation methods, PID controllers yield good results.
Reference Work ow Execution. In order to measure the
e ciency of our online method under online and unknown
conditions, we compare the work ow execution performance
(in terms of the turnaround time to execute all tasks) to a
reference work ow|computed o ine under known
conditions, i.e., all requirements (e.g., runtime, disk, memory)
are accurate and known in advance. We performed several
runs for the reference work ow, which yielded an averaged
makespan of 382,887.7s ( 106h, standard deviation 5%).
5.4
      </p>
    </sec>
    <sec id="sec-10">
      <title>Experimental Results and Discussion</title>
      <p>We have conducted work ow runs with three di erent
types of controllers: (P) only the proportional component
Con guration</p>
      <p>Avg. Makespan (h)</p>
      <p>Slowdown
is evaluated: Kp = 1, and Ki = Kd = 0; (PI) the
proportional and integral components are enabled: Kp = Ki =
1, and Kd = 0; and (PID) all components are activated:
Kp = Ki = Kd = 1. The reference work ow execution is
reported as Reference. We have performed several runs of
each con guration to produce results with statistical
significance (errors below 5%).</p>
      <p>Overall makespan evaluation. Table 2 shows the
average makespan (in hours) for the three con gurations of the
controller and the reference work ow execution. The
degradation of the makespan is expected due to the online and
unknown conditions (no information about the tasks is
available in advance). In spite of the fact that the mean does not
provide accurate estimates, the use of a control loop
mechanism diminishes this e ect. The use of controllers may also
degrade the makespan due to task preemption. However,
if tasks were scheduled only using the estimates from the
mean, the work ow would not complete its execution due to
lack of disk space or memory over ows.</p>
      <p>Executions using PID controllers outperform executions
using only the proportional (P) or the PI controller. The
PID controller slows down the application by 1.08, while the
application slowdown is 1.19 and 1.30 for the PI and P
controllers, respectively. This result suggests that the
derivative component (prediction of future errors) has signi cant
impact on the work ow executions, and that the
accumulation of past errors (integral component) is also important
to prevent and mitigate faults. Therefore, below we analyze
how each of these components in uence the number of tasks
scheduled, and the peaks and troughs of the controller
response function. We did not perform runs where mixed PID,
PI, and P controllers were part of the same simulation, since
it would be very di cult to determine the in uence of each
controller.</p>
      <p>Data footprint. Figure 5 shows the time series of the
number of tasks scheduled or preempted during work ow
executions. For each controller con guration, we present a
single execution, where the makespan is the closest to the
average makespan value shown in Table 2. Task preemptions
(a) Proportional Controller (P)
(b) Proportional-integral Controller (PI)
(c) Proportional-integral-derivative Controller (PID)
are represented as negative values (red bars), while positive
values (blue bars) indicate the number of tasks scheduled
at an instant of time. Additionally, the right y-axis shows
the step response of the controller input value (black line)
for disk usage during the work ow execution. Recall that
positive input values (u(t) &gt; 0, Equation 1) trigger task
scheduling, while negative input values (u(t) &lt; 0) trigger
task preemption.</p>
      <p>The proportional controller (P, Figure 5a) is limited to
the current error, i.e., the amount of disk space that is
over/underutilized. Since the controller input value is strictly
proportional to the error, there is a burst on the number
of tasks to be scheduled at the beginning of the execution.
This bursty pattern and the nearly constant variation of
the input value lead the system to an inconsistent state,
where the remaining tasks to be scheduled cannot lead the
controller within the steady-state. Consequently, tasks are
constantly scheduled and then preempted. In the example
scenario shown in Figure 5a, this process occurs at about 4h,
and performs more than 6,000 preemptions. Table 3 shows
the average number of preemptions and cleanup tasks
occurrences per work ow execution. On average, proportional
controllers produced more than 7,000 preemptions, but no
cleanup tasks. The lack of cleanup tasks indicate that the
number of concurrent executions is very low (mostly in
u</p>
      <p>Controller
# Tasks Preempted
# Cleanup Tasks
P
PI
PID
enced by the number of task preemptions), which is observed
from the high average application slowdown of 1.30.</p>
      <p>The proportional-integral controller (PI, Figure 5b)
aggregates the cumulative error when computing the response of
the controller. As a result, the bursty pattern is smoothed
along the execution, and task concurrency is increased. The
cumulative error tends to increase the response of the PI
controller at each iteration (both positively or negatively).
Thus, task preemption occurs earlier during execution. On
the other hand, this behavior mitigates the vicious cycle
present in the P controllers, and consequently the average
number of preempted tasks is substantially reduced to 168
(Table 3). A drawback of using a PI controller, is the
presence of cleanup tasks, which is due to the higher level of
concurrency among task executions.</p>
      <p>The proportional-integral-derivative controller (PID,
Figure 5c) gives importance to the previous response produced
by the controller (the last computed error). The
derivative component drives the controller to trigger actions once
the current error follows (or increases) the previous error
trend. In this case, the control loop only performs actions
when disk usage is moving towards an over ow or
underutilization state. Note that the number of actions
(scheduling/preemption) triggered in Figure 5c is much less than the
number triggered by the PI controller: the average number
of preempted tasks is 73, and only 4 cleanup tasks on average
are spawned (Table 3).</p>
      <p>Memory Usage. Figure 6 shows the time series of the
number of tasks scheduled or preempted during the
workow executions for the memory controllers. The right y-axis
shows the step response of the controller input value (black
line) for memory usage during the work ow execution. We
present the response function of a controller attached to a
standard cluster (32 cores, 64GB RAM, Section 5.2), which
runs the population, pair_overlap_mutations, and
frequency_overlap_mutations tasks. The total memory
allocations required to run all these tasks is over 4TB, which
might lead the system to memory over ow states.</p>
      <p>When using the proportional controller (P, Figure 6a),
most of the actions are triggered by the data footprint
controller (Figure 5a). As aforementioned, memory does not
become an issue when only the proportional error is taken
into account, since task execution is nearly sequential (low
level of concurrency). As a result, only a few tasks (on
average less than 5) are preempted due to memory over ow.
Note that the process of constant task scheduling ( 50h of
execution) is strongly in uenced by the memory controller.
Also, the step response shown in Figure 6a highlights that
most of the task preemptions occur in the standard cluster.
This result suggests that actions performed by the global
data footprint controller is a ected by actions triggered by
the local memory controller. The analysis of the in uence of
multiple concurrent controllers is out of the scope of this
paper, however this result demonstrates that controllers should
be used sparingly, and actions triggered by controllers should
be performed by priority or the controller hierarchical level.</p>
      <p>The PI controller (Figure 6b) mitigates this e ect, since
the cumulative error prevents the controller from
triggering repeated actions. Observing the step response of the
PI memory controller and the PI data footprint controller
(Figure 5b), we notice that most of the task preemptions
are triggered by the memory controller, particularly in the
rst quarter of the execution. The average data footprint per
task of the population, pair_overlap_mutations, and
frequency_overlap_mutations tasks is 0.02GB, 1.85GB, and
1.83GB (Table 3), respectively. Thus, the data footprint
controller tends to increase the number of concurrent tasks.
In the absence of memory controllers, the work ow
execution would tend to memory over ow, and thus lead to a
failed state.</p>
      <p>The derivative component of the PID controller (Figure 6c)
acts as a catalyst to improve memory usage: it decreases the
overshoot and the settling time without a ecting the
steadystate error. As a result, the number of actions triggered
by the PID memory controller is signi cantly reduced when
compared to the PI or P controllers.</p>
      <p>Although the experiments conducted in this feasibility
study considered equal weights for each of the components in
a PID controller (i.e., Kp = Ki = Kd = 1), we have
demon</p>
      <p>Control Type
P
PI
PID</p>
      <p>Kp
0:50 Ku
0:45 Ku
0:60 Ku</p>
      <p>Ki
{
1:2 Kp=Tu
2 Kp=Tu</p>
      <p>Kd
{
{
Kp Tu=8
strated that correctly de ned executions complete with
acceptable performance, and that faults were detected far in
advance of their occurance, and automatically handled
before they lead the work ow execution to an unrecoverable
state. In the next section, we explore the use of a simple and
commonly used tuning method to calibrate the three PID
gain parameters.
6.</p>
    </sec>
    <sec id="sec-11">
      <title>TUNING PID CONTROLLERS</title>
      <p>
        The goal of tuning a PID loop is to make it stable,
responsive, and to minimize overshooting. However, there is
no optimal way to achieve responsiveness without
compromising overshooting, or vice-versa. Therefore, a plethora of
methods have been developed for tuning PID control loops.
In this paper, we use the Ziegler-Nichols method to tune
the gain parameters of the data footprint and memory
controllers. This is one of the most common heuristics that
attempts to produce tuned values for the three PID gain
parameters (Kp, Ki, and Kd) given two measured feedback
loop parameters derived from the following measurements:
(1) the period Tu of the oscillation frequency at the stability
limit, and (2) the gain margin Ku for loop stability. In this
method, the Ki and Kd gains are rst set to zero. Then, the
proportional gain Kp is increased until it reaches the
ultimate gain Ku, at which the output of the loop starts to
oscillate. Ku and the oscillation period Tu are then used to set
the gains according to the values described in Table 4 [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
A detailed explanation of the method can be found in [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ].
In this section, we will present how we determine the period
Tu, and the gain margin Ku for loop stability.
6.1
      </p>
    </sec>
    <sec id="sec-12">
      <title>Determining Tu and Ku</title>
      <p>
        The Ziegler-Nichols oscillation method is based on
experiments executed on an established closed loop. The overview
of the tuning procedure is as follows [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]:
1. Turn the PID controller into a P controller by setting
      </p>
      <p>Ki = Kd = 0. Initially, Kp is also set to zero;
2. Increase Kp until there are sustained oscillations in the
signal in the control system. This Kp value is denoted
the ultimate (or critical) gain, Ku;
3. Measure the ultimate (or critical) period Tu of the
sustained oscillations; and
4. Calculate the controller parameter values according to
Table 4, and use these parameter values in the
controller.</p>
      <p>Since work ow executions are intrinsically dynamic (due to
the arrival of new tasks at runtime), it is di cult to establish
a sustained oscillation in the signal. Therefore, in this paper
we measured sustained oscillation in the signal within the
execution of long running tasks|in this case the individual
(a) Proportional Controller (P)
(b) Proportional-integral Controller (PI)
(c) Proportional-integral-derivative Controller (PID)
tasks (Table 1). We conducted several runs (O(100)) with
the proportional (P) controller to compute the period Tu and
the gain margin Ku. Table 5 shows the values for Ku and
Tu for each controller used in the paper, as well as the tuned
gain values for Kp, Ki, and Kd for the PID controller.
6.2</p>
      <p>Experimental Evaluation and Discussion
We have conducted runs with the tuned PID controllers
for both the data footprint and memory usage. Figure 7
shows the time series of the number of tasks scheduled or
preempted during the work ow executions, and the step
response of the controller input value (right y-axis). The
average work ow execution makespan is 386,561s, which yields a
slowdown of 1.01. The average number of preempted tasks
is around 18, and only a single cleanup task was used in
each work ow execution. The controller step responses, for
both the data footprint (Figure 7a) and the memory usage
(Figure 7b), show lower peaks and troughs values during the
work ow execution when compared to the PID controllers
using equal weights for the gain parameters (Figures 5c and 6c,
respectively). More speci cally, the controller input value
is reduced by 30% for the memory controller attached to
a standard cluster. This behavior is attained through the
ponderations provided by the tuned parameters. However,
tuning the gain parameters cannot ensure that an optimal
scheduling will be produced for work ow runs (mostly due
to the dynamism inherent to work ow executions) as few
preemptions are still triggered.</p>
      <p>Although the Ziegler-Nichols method provides quasi-optimal
work ow executions (for the work ow studied in this paper),
the key factor of its success is due to the specialization of
the controllers to a single application. In production
systems, such methodology may not be realistic because of the
variety of applications running by di erent users|deploying
a PID controller per application and per component (e.g.,
disk, memory, network, etc.) may signi cantly increase the
complexity of the system and the system's requirements. On
(a) PID Data Footprint Controller
(b) PID Memory Controller
the other hand, controllers may be deployed in the user's
space (or per work ow engine) to manage a small number of
work ow executions. In addition, the time required to
process the current state of the system and decide whether to
trigger an action is nearly instantaneous, what favors the use
of PID controllers on online and real-time work ow systems.
More sophisticated methods (e.g., using machine learning)
may provide better approaches to tune the gain parameters.
However, they may also add an important overhead.</p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSION</title>
      <p>In this paper, we have described, evaluated, and discussed
the feasibility of using simple PID controllers to prevent
and mitigate faults online and under unknown conditions in
work ow executions. We have addressed two common faults
of today's science applications, data storage overload and
memory over ow (main issues in data-intensive work ows),
as use cases to demonstrate the feasibility of the proposed
approach.</p>
      <p>Experimental results using simple de ned control loops
(no tuning) show that faults are detected and prevented
before their occur, leading work ow execution to its
completion with acceptable performance (slowdown of 1.08). The
experiments also demonstrated the importance of each
component in a PID controller. We then used the Ziegler-Nichols
method to tune the gain parameters of the controllers (both
data footprint and memory usage). Experimental results
show that the control loop system produced nearly optimal
scheduling|slowdown of 1.01. Therefore, we claim that the
preliminary results of this work open a new avenue of
research in work ow management systems.</p>
      <p>
        We acknowledge that PID controllers should be used
sparingly, and metrics (and actions) should be de ned in a way
that they do not lead the system to an inconsistent state|as
observed in this paper when only the proportional
component was used. Therefore, we plan to investigate the
simultaneous use of multiple control loops at the application
and infrastructure levels, to determine to which extent this
approach may negatively impact the system. We also plan
to extend our synthetic work ow generator [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] (that can
produce realistic synthetic work ows based on pro les
extracted from execution traces) to generate estimates of data
and memory usages based on the gathered measurements.
Acknowledgments. This work was funded by DOE
contract number #DESC0012636, \Panorama|Predictive
Modeling and Diagnostic Monitoring of Extreme Science
Workows". This work was carried out when Rosa Filgueira
worked for the University of Edinburgh, and was funded by
the Postdoctoral and Early Career Researcher Exchanges
(PECE) fellowship funded by the Scottish Informatics and
Computer Science Allience (SICSA) in 2016, and the
Wellcome Trust-University of Edinburgh Institutional Strategic
Support Fund.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] Populations -
          <volume>1000</volume>
          genome. http://1000genomes.org/category/population.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] 1000genome work ow</article-title>
          . https://github.com/pegasus-isi/1000genome-work ow.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Arabnejad</surname>
          </string-name>
          et al.
          <article-title>Fairness resource sharing for dynamic work ow scheduling on heterogeneous systems</article-title>
          .
          <source>In 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA)</source>
          , pages
          <fpage>633</fpage>
          {
          <fpage>639</fpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bala</surname>
          </string-name>
          et al.
          <article-title>Intelligent failure prediction models for scienti c work ows</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>42</volume>
          (
          <issue>3</issue>
          ):
          <volume>980</volume>
          {
          <fpage>989</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O. A.</given-names>
            <surname>Ben-Yehuda</surname>
          </string-name>
          et al.
          <article-title>Expert: Pareto-e cient task replication on grids and a cloud</article-title>
          .
          <source>In 2012 IEEE 26th International Parallel &amp; Distributed Processing Symposium (IPDPS)</source>
          , pages
          <fpage>167</fpage>
          {
          <fpage>178</fpage>
          . IEEE,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Calheiros</surname>
          </string-name>
          et al.
          <article-title>Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms</article-title>
          .
          <source>Software: Practice and Experience</source>
          ,
          <volume>41</volume>
          (
          <issue>1</issue>
          ):
          <volume>23</volume>
          {
          <fpage>50</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Casanova</surname>
          </string-name>
          .
          <article-title>On the harmfulness of redundant batch requests</article-title>
          .
          <source>In 15th IEEE International Conference on High Performance Distributed Computing</source>
          , pages
          <volume>255</volume>
          {
          <fpage>266</fpage>
          . IEEE,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>I. Casas</surname>
          </string-name>
          et al.
          <article-title>A balanced scheduler with data reuse and replication for scienti c work ows in cloud computing systems</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          et al.
          <article-title>Work ow overhead analysis and optimizations</article-title>
          .
          <source>In Proceedings of the 6th workshop on Work ows in support of large-scale science</source>
          , pages
          <volume>11</volume>
          {
          <fpage>20</fpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          et al.
          <article-title>Dynamic and fault-tolerant clustering for scienti c work ows</article-title>
          .
          <source>IEEE Transactions on Cloud Computing</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <volume>49</volume>
          {
          <fpage>62</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>A. M. Chirkin</surname>
          </string-name>
          et al.
          <article-title>Execution time estimation for work ow scheduling</article-title>
          .
          <source>In 9th Workshop on Work ows in Support of Large-Scale Science (WORKS)</source>
          , pages
          <fpage>1</fpage>
          {
          <fpage>10</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12] .
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Consortium</surname>
          </string-name>
          et al.
          <article-title>A global reference for human genetic variation</article-title>
          .
          <source>Nature</source>
          ,
          <volume>526</volume>
          (
          <issue>7571</issue>
          ):
          <volume>68</volume>
          {
          <fpage>74</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Costa</surname>
          </string-name>
          et al.
          <article-title>Handling failures in parallel scienti c work ows using clouds</article-title>
          .
          <source>In High Performance Computing, Networking, Storage and Analysis (SCC)</source>
          , pages
          <fpage>129</fpage>
          {
          <fpage>139</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Deelman</surname>
          </string-name>
          et al.
          <article-title>Pegasus, a work ow management system for science automation</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <volume>46</volume>
          (
          <issue>0</issue>
          ):
          <volume>17</volume>
          {
          <fpage>35</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          da Silva et al.
          <article-title>Self-healing of work ow activity incidents on distributed computing infrastructures</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <volume>29</volume>
          (
          <issue>8</issue>
          ):
          <volume>2284</volume>
          {
          <fpage>2294</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          da Silva et al.
          <article-title>Community resources for enabling and evaluating research on scienti c work ows</article-title>
          .
          <source>In 10th IEEE International Conference on e-Science, eScience'14</source>
          , pages
          <fpage>177</fpage>
          {
          <fpage>184</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          da Silva et al.
          <article-title>Controlling fairness and task granularity in distributed, online, non-clairvoyant work ow executions</article-title>
          .
          <source>Concurrency and Computation: Practice and Experience</source>
          ,
          <volume>26</volume>
          (
          <issue>14</issue>
          ):
          <volume>2347</volume>
          {
          <fpage>2366</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          da Silva et al.
          <article-title>Online task resource consumption prediction for scienti c work ows</article-title>
          .
          <source>Parallel Processing Letters</source>
          ,
          <volume>25</volume>
          (
          <issue>3</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferro</surname>
          </string-name>
          et al.
          <article-title>A proposal to apply inductive logic programming to self-healing problem in grid computing: How will it work?</article-title>
          <source>Concurrency and Computation: Practice and Experience</source>
          ,
          <volume>23</volume>
          (
          <issue>17</issue>
          ):
          <volume>2118</volume>
          {
          <fpage>2135</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hary</surname>
          </string-name>
          et al.
          <article-title>Design and evaluation of a self-healing kepler for scienti c work ows</article-title>
          .
          <source>In 19th ACM International Symposium on High Performance Distributed Computing (HPDC)</source>
          , pages
          <fpage>340</fpage>
          {
          <fpage>343</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Haugen</surname>
          </string-name>
          .
          <article-title>Ziegler-nichols' closed-loop method</article-title>
          .
          <source>Technical report, TechTeach</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hiden</surname>
          </string-name>
          et al.
          <article-title>A framework for dynamically generating predictive models of work ow execution</article-title>
          .
          <source>In 8th Workshop on Work ows in Support of Large-Scale Science (WORKS)</source>
          , pages
          <fpage>77</fpage>
          {
          <fpage>87</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Juve</surname>
          </string-name>
          et al.
          <article-title>Practical resource monitoring for robust high throughput computing</article-title>
          .
          <source>In 2nd Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications</source>
          , HPCMASPA'
          <volume>15</volume>
          , pages
          <fpage>650</fpage>
          {
          <fpage>657</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kandaswamy</surname>
          </string-name>
          et al.
          <article-title>Fault tolerance and recovery of scienti c work ows on computational grids</article-title>
          .
          <source>In 2008. CCGRID'08. 8th IEEE International Symposium on Cluster Computing and the Grid</source>
          , pages
          <volume>777</volume>
          {
          <fpage>782</fpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ko</surname>
          </string-name>
          hler et al.
          <article-title>Improving work ow fault tolerance through provenance-based recovery</article-title>
          .
          <source>In International Conference on Scienti c and Statistical Database Management</source>
          , pages
          <volume>207</volume>
          {
          <fpage>224</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>A. S.</surname>
          </string-name>
          <article-title>McCormack et al</article-title>
          .
          <article-title>Rule-based autotuning based on frequency domain identi cation</article-title>
          .
          <source>Control Systems Technology, IEEE Transactions on, 6</source>
          (
          <issue>1</issue>
          ):
          <volume>43</volume>
          {
          <fpage>61</fpage>
          ,
          <year>1942</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Montagnat</surname>
          </string-name>
          et al.
          <article-title>Work ow-based comparison of two distributed computing infrastructures</article-title>
          .
          <source>In 2010 5th Workshop on Work ows in Support of Large-Scale Science (WORKS)</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          10. IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muthuvelu</surname>
          </string-name>
          et al.
          <article-title>Task granularity policies for deploying bag-of-task applications on global grids</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <volume>170</volume>
          {
          <fpage>181</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>I. Pietri</surname>
          </string-name>
          et al.
          <article-title>A performance model to estimate execution time of scienti c work ows on the cloud</article-title>
          .
          <source>In 2014 9th Workshop on Work ows in Support of Large-Scale Science (WORKS)</source>
          , pages
          <fpage>11</fpage>
          {
          <fpage>19</fpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>D.</given-names>
            <surname>Poola</surname>
          </string-name>
          et al.
          <article-title>Fault-tolerant work ow scheduling using spot instances on clouds</article-title>
          .
          <source>Procedia Computer Science</source>
          ,
          <volume>29</volume>
          :
          <fpage>523</fpage>
          {
          <fpage>533</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>D.</given-names>
            <surname>Poola</surname>
          </string-name>
          et al.
          <article-title>Enhancing reliability of work ow execution using task replication and spot instances</article-title>
          .
          <source>ACM Transactions on Autonomous and Adaptive Systems (TAAS)</source>
          ,
          <volume>10</volume>
          (
          <issue>4</issue>
          ):
          <fpage>30</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <article-title>Pid simulator</article-title>
          . https://github.com/rafaelfsilva/pid-simulator.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          et al.
          <article-title>A cleanup algorithm for implementing storage constraints in scienti c work ow executions</article-title>
          .
          <source>In 9th Workshop on Work ows in Support of Large-Scale Science, WORKS'14</source>
          , pages
          <fpage>41</fpage>
          {
          <fpage>49</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Sung</surname>
          </string-name>
          et al.
          <article-title>Proportional{integral{derivative control</article-title>
          .
          <source>Process Identi cation and PID Control</source>
          , pages
          <volume>111</volume>
          {
          <fpage>149</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          et al.
          <article-title>Work ows for e-Science: scienti c work ows for grids</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          et al.
          <article-title>Combined fault tolerance and scheduling techniques for work ow applications on computational grids</article-title>
          .
          <source>In 9th IEEE/ACM International Symposium on Cluster Computing and the Grid</source>
          , pages
          <volume>244</volume>
          {
          <fpage>251</fpage>
          . IEEE Computer Society,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          et al.
          <article-title>Optimum settings for automatic controllers</article-title>
          .
          <source>trans. ASME</source>
          ,
          <volume>64</volume>
          (
          <issue>11</issue>
          ),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>