=Paper= {{Paper |id=Vol-1800/paper4 |storemode=property |title=Towards Serverless Execution of Scientific Workflows - HyperFlow Case Study |pdfUrl=https://ceur-ws.org/Vol-1800/paper4.pdf |volume=Vol-1800 |authors=Maciej Malawski |dblpUrl=https://dblp.org/rec/conf/sc/Malawski16 }} ==Towards Serverless Execution of Scientific Workflows - HyperFlow Case Study== https://ceur-ws.org/Vol-1800/paper4.pdf
     Towards Serverless Execution of Scientific Workflows –
                    HyperFlow Case Study

                                                      Maciej Malawski
                                         AGH University of Science and Technology
                                            Department of Computer Science
                                                     Krakow, Poland
                                                  malawski@agh.edu.pl

ABSTRACT                                                               from having to maintain a server, including configuration
Scientific workflows consisting of a high number of depen-             and management of virtual machines, while resource man-
dent tasks represent an important class of complex scientific          agement is provided by the platform in an automated and
applications. Recently, a new type of serverless infrastruc-           scalable way.
tures has emerged, represented by such services as Google                 In this paper we take a look at such serverless infras-
Cloud Functions or AWS Lambda. In this paper we take                   tructures. Although designed mainly for processing back-
a look at such serverless infrastructures, which are designed          ground tasks of Web applications, we nevertheless investi-
mainly for processing background tasks of Web applications.            gate whether they can be applied to more compute- and
We evaluate their applicability to more compute- and data-             data-intensive scientific workflows. The main objectives of
intensive scientific workflows and discuss possible ways to            this paper are as follows:
repurpose serverless architectures for execution of scientific
workflows. A prototype workflow executor function has been                  • To present the main features of serverless infrastruc-
developed using Google Cloud Functions and coupled with                       tures, comparing them to traditional infrastructure-as-
the HyperFlow workflow engine. The function can run work-                     a-service clouds,
flow tasks on the Google infrastructure, and features such ca-
pabilities as data staging to/from Google Cloud Storage and                 • To discuss the options of using serverless infrastruc-
execution of custom application binaries. We have success-                    tures for execution of scientific workflows,
fully deployed and executed the Montage astronomic work-
flow, often used as a benchmark, and we report on initial re-               • To present our experience with a prototype implemented
sults of performance evaluation. Our findings indicate that                   using HyperFlow [3] workflow engine and Google Cloud
the simple mode of operation makes this approach easy to                      Functions (alpha version),
use, although there are costs involved in preparing portable
application binaries for execution in a remote environment.                 • To evaluate our approach using the Montage work-
   While our evaluation uses a preproduction (alpha) ver-                     flow [20], a real-world astronomic application,
sion of the Google Cloud Functions platform, we find the
presented approach highly promising. We also discuss pos-                   • To discuss the costs and benefits of this approach, to-
sible future steps related to execution of scientific workflows               gether with its implications for resource management
in serverless infrastructures, and the implications with re-                  of scientific workflows in emerging infrastructures.
gard to resource management for scientific applications in
general.                                                                  The paper is organized as follows. We begin with an
                                                                       overview of serverless infrastructures in Section 2. In Sec-
                                                                       tion 3 we propose and discuss alternative options for server-
Keywords                                                               less architectures of scientific workflow systems. Our pro-
Scientific workflows, cloud, serverless infrastructures                totype implementation, based on HyperFlow and GCF, is
                                                                       described in Section 4. This is followed by evaluation us-
                                                                       ing the Montage application, presented in Section 5. We
1.   INTRODUCTION                                                      discuss implications for resource management in Section 6
   Scientific workflows consisting of a high number of depen-          and present related work in Section 7. Section 8 provides a
dent tasks represent an important class of complex scien-              summary and description of future work.
tific applications that have been successfully deployed and
executed in traditional cloud infrastructures, including In-
frastructure as a Service (IaaS) clouds. Recently, a new type          2.    OVERVIEW OF SERVERLESS CLOUDS
of serverless infrastructures emerge, represented by such ser-           Writing “serverless” applications is a recent trend, mainly
vices as Google Cloud Functions (GCF) [2] or AWS Lambda [1].           addressing Web applications. It frees programmers from
These services allow deployment of software in the form of             having to maintain a server – instead they can use a set
functions that are executed in the provider’s infrastructure           of existing cloud services directly from their application.
in response to specific events such as new files being up-             Examples of such services include cloud databases such as
loaded to a cloud data store, messages arriving in queue               Firebase or DynamoDB, messaging systems such as Google
systems or direct HTTP calls. This approach frees the user             Cloud Pub/Sub, notification services such as Amazon SNS


Copyright held by the author(s).
                                                                  25
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah



and so on. When there is a need to execute custom applica-                Serverless infrastructures can be cost-effective compared
tion code in the background, special “cloud functions” (here-          to standard VMs. For example, the aggregate cost of run-
after simply referred to as functions) can be called. Exam-            ning AWS Lambda functions with 1 GB memory for 1 hour
ples of such functions are AWS Lambda and Google Cloud                 is $0.060012. This is more expensive than the t2.micro in-
Functions (GCF).                                                       stance, which also has 1 GB of RAM but costs $0.013 per
   Both Lambda and GCF are based on the functional pro-                hour. A T2.micro instance, however, offers only burstable
gramming paradigm: a function is a piece of software that              performance, which means only a fraction of CPU time per
can be deployed on the providers’ cloud infrastructure and it          hour is available. The smallest standard instance at AWS is
performs a single operation in response to an external event.          m3.medium, which costs $0.067 per hour, but gives 3.75 GB
   Functions can be triggered by:                                      of RAM. Cloud functions are thus more suitable for vari-
                                                                       able load conditions while standard instances can be more
   • an event generated by the cloud infrastructure, e.g.              economical for applications with stable workloads.
     a change in a cloud database, a file being uploaded
     to a cloud object store, a new item appearing in a
     messaging system, or an action scheduled at a specified           3.    OPTIONS FOR EXECUTION OF SCIEN-
     time,                                                                   TIFIC WORKFLOWS IN SERVERLESS
                                                                             INFRASTRUCTURES
   • a direct request from the application via HTTP or
     cloud API calls.                                                     In light of the identified features and limitations of server-
                                                                       less infrastructures and cloud functions, we can discuss the
   The cloud infrastructure which hosts the functions is re-           option of using them for execution of scientific workflows.
sponsible for automatic provisioning of resources (including           We will start with a traditional execution model in IaaS
CPU, memory, network and temporary storage), automatic                 cloud with no cloud functions (1), then present the queue
scaling when the number of function executions varies over             model (2), direct executor model (3), bridge model (4), and
time, as well as monitoring and logging. The user is re-               decentralized model (5). These options are schematically
sponsible for providing executable code in a format required           depicted in Fig. 1, and discussed in detail further on.
by the framework. Typically, the execution environment is
limited to a set of supported languages: Node.js, Java and             3.1    Traditional model
Python in the case of AWS Lambda, and Node.js in the                      The traditional model assumes the workflow is running
case of GCF. The user has no control over the execution                in a standard IaaS cloud. In this model the workflow ex-
environment, such as underlying operating system, version              ecution follows the well-known master-worker architecture,
of the runtime libraries, etc., but can use custom libraries           where the master node runs a workflow engine, tasks that are
with package managers and even upload binary code to be                ready for execution are submitted to a queue, and worker
executed.                                                              nodes process these tasks in parallel when possible. The
   Functions are thus different from Virtual Machines in IaaS          master node can be deployed in the cloud or outside of the
clouds where the users have full control over the OS (includ-          cloud, while worker nodes are usually deployed as VMs in
ing root access) and can customize the execution environ-              a cloud infrastructure. The worker pool is typically created
ment to their needs. On the other hand, functions free the             on demand and can be dynamically scaled up or down de-
developers from the need to configure, maintain, and man-              pending on resource requirements.
age server resources.                                                     Such a model is represented e.g. by Pegasus and Hyper-
   Cloud providers impose certain limits on the amount of re-          Flow. The Pegasus Workflow Management System [6] uses
sources a function can consume. In the case of AWS Lambda              HTCondor [18] to maintain its queue and manage work-
these limits are as follows: temporary disk space: 512 MB,             ers. HyperFlow [3] is a lightweight workflow engine based
number of processes and threads: 1024, maximum execution               on Node.js – it uses RabbitMQ as its queue and AMQP
duration per request: 300 seconds. There is also a limit of            Executors on worker nodes. The deployment options of Hy-
100 concurrent executions per region, but this limit can be            perFlow on grids and clouds are discussed in detail in [4].
increased on request. GCF, in its alpha version, does not                 In this model the user is responsible for management of
specify limit thresholds. There is, however, a timeout pa-             resources comprising the worker pool. The pool can be pro-
rameter that can be provided when deploying a function and             visioned statically, which is commonly done in practice, but
the default value is 60 seconds.                                       there is also ongoing research on automatic or dynamic re-
   Functions are thus different from permanent and state-              source provisioning for workflow applications [13, 15], which
ful services, since they are not long-running processes, but           is a non-trivial task.
rather serve individual tasks. Resource limits indicate that              In the traditional cloud workflow processing model there
such cloud functions are not currently suitable for large-scale        is a need for some storage service to store input, output
HPC applications, but can be useful for high-throughput                and temporary data. There are multiple options for data
computing workflows consisting of many fine-grained tasks.             sharing [9], but one of the most widely used approaches is to
   Functions have a fine-grained pricing model associated              rely on existing cloud storage, such as Amazon S3 or Google
with them. In the case of AWS Lambda, the price is $0.20               Cloud Storage. This option has the advantage of providing
per 1 million requests and $0.00001667 for every GB-second             a permanent store so that data is not lost after the workflow
used, defined as CPU time multiplied by the amount of                  execution is complete and the VMs are terminated.
memory used. There are also additional charges for data
transfer and storage (when DynamoDB or S3 is used). The                3.2    Queue model
alpha version of Google Cloud Functions does not have a                  This model is similar to the traditional model: the master
public pricing policy.                                                 node and the queue remain unchanged, but the worker is




                                                                  26
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


      Traditional                Queue                  Direct executor                           Bridge
        model                    model                      model                                 model                         Decentralized
                                                                                                                                   model
                    Engine                     Engine                     Engine                             Engine

                                                                          Workers



               Queue                    Queue           Cloud Functions                                Queue                   Cloud Functions

                                          Workers
                Workers
                                                                    Storage                                           Bridge
                                                                                                                      worker               Storage
                                                                                       VM    VM
                                                                                                   Workers
 VM       VM        VM       Cloud Functions


               Storage                   Storage                                                       Cloud Functions
                                                                                   Storage


                     Figure 1: Options of serverless architectures for execution of scientific workflows.


replaced by a cloud function. Instead of running a pool                    calls. Regarding development effort, it requires changes in
of VMs with workers a set of cloud functions is spawned.                   the master and a new implementation of the worker.
Each function fetches a task from a queue and processes it,                   An advantage of this model is its cleanness and simplic-
returning results via the queue.                                           ity, but these come at the cost of tight master-worker cou-
   The main advantage of this model is its simplicity, since               pling Accordingly, it becomes more difficult to implement
it only requires changes in the worker module. This may                    the multi-cloud scenario, since the workflow engine would
be simple if the queue uses a standard protocol, such as                   need to be able to dispatch tasks to multiple cloud function
AMQP in the case of HyperFlow Executor, but in the case                    providers.
of Pegasus and HTCondor a Condor daemon (condor startd)
must run on the worker node and communicate using a pro-                   3.4       Bridge model
prietary Condor protocol. In this scenario implementing a                    This solution is more complex but it preserves the de-
worker as a cloud function would require more effort.                      coupling of the master from the worker, using a queue. In
   Another advantage of the presented model is the ability                 this case the master and the queue remain unchanged, but
to combine the workers implemented as functions with other                 a new type of bridge worker is added. It fetches tasks from
workers running e.g. in a local cluster or in a traditional                the queue and dispatches them to the cloud functions. Such
cloud. This would also enable concurrent usage of cloud                    a worker needs to run as a separate service (daemon) and
functions from multiple providers (e.g. AWS and Google)                    can trigger cloud functions using the provider-specific API.
when such a multi-cloud scenario is required.                                The decoupling of the master from the worker allows for
   An important issue associated with the queue model is                   more complex and flexible scenarios, including multi-cloud
how to trigger the execution of the functions. If a native                 deployments. A set of bridge workers can be spawned, each
implementation of the queue is used (e.g. RabbitMQ as in                   dispatching tasks to a different cloud function provider. More-
HyperFlow), it is necessary to trigger a function for each task            over, a pool of workers running in external distributed plat-
added to the queue. This can be done by the workflow engine                forms, such as third-party clouds or clusters, can be used
or by a dedicated queue monitoring service. Other options                  together with cloud functions.
include periodic function execution or recursive execution:
a function can itself trigger other functions once it finishes             3.5       Decentralized model
processing data.                                                              This model re-implements the whole workflow engine in a
   To ensure a clean serverless architecture another option is             distributed way using cloud functions. Each task of a work-
to implement the queue using a native cloud service which is               flow is processed by a separate function. These functions
already integrated with cloud functions. In the case of AWS                can be triggered by (a) new data items uploaded to cloud
Lambda one could implement the queue using DynamoDB:                       storage, or (b) other cloud functions, i.e. predecessor tasks
here, a function could be triggered by adding a new item to                triggering their successor tasks following completion. Op-
a task table. In the case of GFC, a Google Cloud Pub/Sub                   tion (a) can be used to represent data dependencies in a
service can be used for the same purpose. Such a solution,                 workflow while option (b) can be used to represent control
however, would require more changes in the workflow engine                 dependencies.
and would not be easy to deploy in multi-cloud scenarios.                     In the decentralized model the structure and state of work-
                                                                           flow execution have to be preserved in the system. The sys-
3.3      Direct executor model                                             tem can be implemented in a fully distributed way, by de-
  This is the simplest model and requires only a workflow                  ploying a unique function for each task in the workflow. In
engine and a cloud function that serves as a task executor.                this way, the workflow structure is mapped to a set of func-
It eliminates the need for a queue since the workflow en-                  tions and the execution state propagates by functions being
gine can trigger the cloud function directly via API/HTTP                  triggered by their predecessors. Another option is to deploy




                                                                   27
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah



                                   HyperFlow                             The schematic diagram of the prototype is shown in Fig. 2.
                                    Engine                            The HyperFlow engine is extended with the GCFCommand
                                                                      function which is responsible for communication with GCF.
                  GCF                                                 It is a replacement for AMQPCommand function, which is used
                Command                                               in the standard HyperFlow distributed deployment with AMQP
                                                                      protocol and RabbitMQ. GCFCommand sends the task de-
                                                                      scription in a JSON-encoded message to the cloud function.
                  GCF           Worker
                                                                      The GCF Executor is the main cloud function which needs
                Executor                                              to be deployed on the GCF platform. It processes the mes-
                                                                      sage, and uses the Storage Client for staging in and out the
                 Storage
                                                                      input and output data. It uses Google Cloud Storage and
                  Client
                                                                      requests parallel transfers to speed up download and upload
                  Cloud
                                                                      of data. GCF Executor calls the executable which needs to
                Functions
                                                                      be deployed together with the function. GCF supports run-
                                                                      ning own Linux-based custom binaries, but the user has to
                              Storage                                 make sure that the binary is portable, e.g. by statically link-
                                                                      ing all of its dependencies. Our architecture is thus purely
                                                                      serverless, with the HyperFlow engine running on a client
Figure 2: Architecture of the prototype based on                      machine and directly relying only on cloud services such as
HyperFlow and Google Cloud Functions                                  GCF and Cloud Storage.

                                                                      4.2     Fault tolerance
a generic task executor function and maintain the workflow              Transient failures are a common risk in cloud environ-
state in a database, possibly one provided as a cloud service.        ments. Since execution of a possibly large volume of con-
  The advantages of the decentralized approach include fully          current HTTP requests in a distributed environment is al-
distributed and serverless execution, without the need to             ways prone to errors caused by various layers of network and
maintain a workflow engine. The required development ef-              middleware stacks, the execution engine needs to be able to
fort is extensive, since it requires re-implementation of the         handle such failures gracefully and attempt to retry failed
whole workflow engine. A detailed design of such an en-               requests.
gine is out of scope of this paper, but remains an interesting          In the case of HyperFlow, the Node.js ecosystem appears
subject of future research.                                           very helpful in this context. We used the requestretry li-
                                                                      brary for implementing the HTTP client, which allows for
3.6   Summary of options                                              automatic retry of failed requests with a configurable num-
   As we can see, cloud functions provide multiple integra-           ber of retries (default: 5) and delay between retries (default:
tion options with scientific workflow engines. The users need         5 seconds). Our prototype uses these default settings, but
to decide which option is best for them based on their re-            in the future it will be possible to explore more advanced
quirements, most notably the allowed level of coupling be-            error handling policies taking into account error types and
tween the workflow engine and the infrastructure and the              patterns.
need to run hybrid or cross-cloud deployments where re-
sources from more than one provider are used in parallel.
We consider the fully decentralized option as an interesting          5.    EVALUATION USING MONTAGE WORK-
future research direction, while in the following sections we               FLOW
will focus on our experience with a prototype implemented                Based on our prototype which combines HyperFlow and
using the direct executor model.                                      Google Cloud Functions, we performed several experiments
                                                                      to evaluate our approach. The goals of the evaluation are as
4.    PROTOTYPE BASED ON HYPERFLOW                                    follows:
   To evaluate the feasibility of our approach we decided to
develop a prototype using the HyperFlow engine and Google                  • To validate the feasibility of our approach, i.e. to
Cloud Functions, applying the direct executor model. This                    determine whether it is practical to execute scientific
decision has several reasons. First, HyperFlow is imple-                     workflows in serverless infrastructures.
mented in Node.js, while GCF supports Node.js as a native
function execution environment. This good match simpli-                    • To measure performance characteristics of the execu-
fies development and debugging, which is always non-trivial                  tion environment in order to provide hints for resource
in a distributed environment. Our selection of the direct                    management.
execution model was motivated by the extensible design of
HyperFlow, which can associate with each task in a work-                Details regarding our sample application, experiment setup
flow a specific executor function responsible for handling            and results are provided below.
command-line tasks. Since GCF provides a direct trigger-
ing mechanism of cloud functions using HTTP calls, we can             5.1     Montage workflow and experiment setup
apply existing HTTP client libraries for Node.js, plugging
support for GCF into HyperFlow as a natural extension.                Montage application.
                                                                       For our study we selected the Montage [8] application,
4.1   Architecture and components                                     which is an astronomic workflow. It is often used for various




                                                                 28
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah



benchmarks and performance evaluation, since it is open-
source and has been widely studied by the research com-                           mJPEG
munity. The application processes a set of input images                           mShrink
                                                                                    mAdd
from astronomic sky surveys and constructs a single large-                        mImgtbl
scale mosaic image. The structure of the workflow is shown                    mBackground
                                                                              mBackground
in Fig. 3: it consists of several stages which include paral-                 mBackground
lel processing sections, reduction operations and sequential                  mBackground
                                                                              mBackground
processing.                                                                   mBackground
                                                                              mBackground
                                                                              mBackground
                                      mProjectPP                              mBackground
                                                                              mBackground
                                      mDiffFit                                  mBgModel
                                                                               mConcatFit
                                                                                  mDiffFit              task
                                      mConcatFit                                  mDiffFit
                                                                                  mDiffFit
                                                                                  mDiffFit                    mAdd
                                      mBgModel




                                                                       Task
                                                                                  mDiffFit
                                                                                  mDiffFit                    mBackground
                                      mBackground                                 mDiffFit
                                                                                  mDiffFit
                                      mImgTbl                                     mDiffFit                    mBgModel
                                                                                  mDiffFit
                                      mAdd                                        mDiffFit
                                                                                  mDiffFit
                                                                                                              mConcatFit
                                                                                  mDiffFit
                                      mShrink                                     mDiffFit                    mDiffFit
                                                                                  mDiffFit
                                      mJPEG                                       mDiffFit
                                                                                                              mImgtbl
                                                                                  mDiffFit
                                                                               mProjectPP
Figure 3: Structure of the Montage workflow used                               mProjectPP                     mJPEG
                                                                               mProjectPP
for experiments                                                                mProjectPP
                                                                               mProjectPP                     mProjectPP
                                                                               mProjectPP
   The size of the workflow, i.e. the number of tasks, de-                     mProjectPP                     mShrink
pends on the size of the area of the target image, which                       mProjectPP
                                                                               mProjectPP
is measured in angular degrees. For example, a small-scale                     mProjectPP
0.25-degree Montage workflow consists of 43 tasks, with 10
parallel mProjectPP tasks and 17 mDiffFit tasks, while more                                  0   25      50        75      100
complex workflows can involve thousands of tasks. In our                                         Time in seconds
experiments we used the Montage 0.25 workflow with 43
tasks, and the Montage 0.4 workflow with 107 tasks.
                                                                      Figure 4: Sample run of Montage 0.25 workflow.
Experiment setup.
   We used a recent version of HyperFlow and an Alpha ver-
sion of Google Cloud Functions. The HyperFlow engine was             execution traces we plotted Gantt charts. Altogether, sev-
installed on a client machine with Ubuntu 14.04 LTS Linux            eral runs were performed and an example execution trace
and Node.js 4.5.0. For staging the input and output data,            (representative of all runs) is shown in Fig. 4.
as well as for temporary storage, we used a Google Cloud                Montage 0.25 is a relatively small-scale workflow, but the
Storage bucket with standard options. Both Cloud Func-               resulting plot clearly reveals that the cloud function-based
tions and Cloud Storage were located in the us-cental-1              approach works well in this case. We can observe that the
region, while the client machine was located in Europe.              parallel tasks of the workflow (mProjectPP, mDiffFit amd
                                                                     mBackground) are indeed short-running and can be pro-
Data preparation and handling.                                       cessed in parallel. The user has no control over the level of
  To run the Montage workflow in our experiments all in-             parallelism, but the cloud platform is able to process tasks
put data needs to be uploaded to the cloud storage first. For        in a scalable way, as stated in the documentation. We also
each workflow run, a separate subfolder in the Cloud Storage         observe no significant delays between task execution and can
bucket is created. The subfolder is then used for exchange of        attribute this to the fact that the requests between Hyper-
intermediate data and for storing the final results. Data can        Flow engine and the cloud functions are transmitted using
be conveniently uploaded using a command-line tool which             HTTP over a wide-area network, including a trans-Atlantic
supports parallel transfers. The web-based Google Cloud              connection.
Console is useful for browsing results and displaying the re-           Similar results were obtained for the Montage 0.4 work-
sulting JPEG images.                                                 flow which consists of 107 tasks, however the corresponding
                                                                     detailed plots are not reproduced here for reasons of read-
5.2   Feasibility                                                    ability. It should be noted that while the parallel tasks of
  To assess the feasibility of our approach we tested our            Montage are relatively fine-grained, the execution time of se-
prototype using the Montage 0.25 workflow. We collected              quential processing tasks such as mImgTbl and mAdd grows
task execution start and finish timestamps, which give the           with the size of the workflow and can exceed the default limit
total duration of cloud function execution. This execution           of 60 seconds imposed upon cloud function execution. This
time also includes data transfers. Based on the collected            limit can be extended when deploying the cloud function,




                                                                29
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah



but there is currently no information regarding the maxi-
mum duration of such requests. We can only expect that                                                                        Duration of tasks
such limits will increase as the platforms become more ma-                                    7
                                                                                                                ● ●

ture. This was indeed the case with Google App Engine,
where the initial request limit was increased from 30 sec-                                                                                  ●    ●
                                                                                                                                                 ●       ●
                                                                                                                                        ●        ●
onds to 10 minutes [14].                                                                             ●●         ●
                                                                                                                ●     ●                          ●
                                                                                                                                                 ●
                                                                                              6       ●
                                                                                                                                 ● ●                    ●

5.3      Deployment size and portability                                                                    ●                                   ● ●    ●




                                                                        Duration in seconds
                                                                                                      ●                         ●                 ●                ●         ●
                                                                                                                                    ●             ●
                                                                                                                                                  ●
   Our current approach requires us to deploy the cloud func-                                            ●
                                                                                                          ● ● ●●
                                                                                                                 ●●
                                                                                                                                        ●     ● ●
                                                                                                                                                  ●      ●
                                                                                                                                                                ●
                                                                                                                                                                             ●
                                                                                                                                                                             ●
                                                                                                                                                                                      ●

                                                                                                                                            ●                                ●●       ●
tion together with all the application binaries. The Google                                   5                       ●
                                                                                                                                  ●
                                                                                                                                   ●
                                                                                                                                                  ●
                                                                                                                                                 ●●
                                                                                                                                                         ●●
                                                                                                     ●          ●                   ●                   ●●                        ●
Cloud Function execution environment enables inclusion of                                           ●                           ● ●●
                                                                                                                                ●   ●
                                                                                                                                            ●
                                                                                                                                            ●
                                                                                                                                               ●
                                                                                                                                                 ●      ●
                                                                                                                                                                ●
                                                                                                                                                                 ●
                                                                                                                      ●           ●                ●    ●
                                                                                                                                             ●
dependencies in Node.js libraries packaged using Node Pack-                                         ●     ●
                                                                                                           ●●
                                                                                                                                  ●●
                                                                                                                                                  ● ●
                                                                                                                                                                             ●
                                                                                                                                                                                     ●
                                                                                                                                      ● ●●              ●                    ●    ●
age Manager (NPM) which are automatically installed when                                              ●
                                                                                                      ●
                                                                                                            ●
                                                                                                                 ●●
                                                                                                                     ● ● ●
                                                                                                                        ●
                                                                                                                                ●
                                                                                                                                   ●
                                                                                                                                   ●
                                                                                                                                      ●
                                                                                                                                           ● ●● ●
                                                                                                                                           ●
                                                                                                                                                  ●
                                                                                                                                                  ●● ●
                                                                                                                                                      ●● ●
                                                                                                                                                          ●
                                                                                                                                                               ●
                                                                                                                                                                        ●    ●
                                                                                                                                                                            ●●
                                                                                                                                                                                   ●
                                                                                                    ●●       ● ●                 ● ●● ●● ●     ●                                ●
                                                                                                                                  ●● ● ●●● ● ●●
the function is deployed. Moreover, the user can provide a                                    4
                                                                                                     ●●         ●●     ●
                                                                                                                                  ●
                                                                                                                                   ●
                                                                                                                                        ●
                                                                                                                                              ●●●● ● ●
                                                                                                                                                  ● ●
                                                                                                                                                                        ●     ●
                                                                                                                                                                              ●●● ●
                                                                                                     ● ●●        ●     ●         ● ●●●●●●●               ●     ●
                                                                                                                                                                            ●
                                                                                                                                                                            ●        ●
                                                                                                                      ●
                                                                                                                                                   ●●● ●●          ●         ● ●●
set of JavaScript source files, configuration and binary de-                                         ●●
                                                                                                          ● ●
                                                                                                                   ● ● ●
                                                                                                                                 ●
                                                                                                                                    ●●
                                                                                                                                          ● ●
                                                                                                                                             ●
                                                                                                                                                  ●
                                                                                                                                                      ● ●
                                                                                                                                                   ● ●●
                                                                                                                                                                     ●
                                                                                                                                                                  ●● ●            ●
                                                                                                      ● ●                            ●●● ●●
                                                                                                                                 ●●● ●        ●       ●●●        ●● ●● ●
                                                                                                                                                                    ● ●            ●
                                                                                                                                        ●     ●                                ●●
pendencies to be uploaded together with the function.                                                   ●●
                                                                                                                 ●
                                                                                                        ● ●● ● ● ●
                                                                                                                  ●●
                                                                                                                     ●
                                                                                                                     ●          ●● ● ●●  ●
                                                                                                                                             ●●
                                                                                                                                                ●       ●
                                                                                                                                                       ● ●●
                                                                                                                                                           ●   ●
                                                                                                                                                               ● ●
                                                                                                                                                                    ●
                                                                                                                                                                   ●● ● ●●
                                                                                                                                                                   ● ●●        ●●●
                                                                                                                                      ●        ●●●             ●
                                                                                                                                                               ●       ●●        ●
                                                                                                                                                                        ●      ●
   In the case of the Montage application, the users need                                              ●       ●●    ●
                                                                                                                      ●            ●    ●
                                                                                                                                                    ●
                                                                                                                                                        ●      ●●
                                                                                                                                                                 ●
                                                                                                                                                                 ● ● ●●
                                                                                                                                                                           ●
                                                                                                                                                                           ●
                                                                                                                                                                            ● ●
                                                                                                                                                                             ●
                                                                                                    ●●●●   ●  ●●                                       ●                    ● ●
                                                                                                                                                                          ● ● ●
                                                                                                        ●     ●      ●          ●                              ● ●●●
to prepare application binaries in a portable format. Since                                   3      ●
                                                                                                       ●●       ● ●
                                                                                                                      ●            ●
                                                                                                                                            ●                   ● ● ●              ●
Montage is distributed in source1 format, it can be compiled                                         ●           ● ● ●●●
                                                                                                                   ●
                                                                                                                          ●
                                                                                                                                                                           ●      ●


and statically linked with all libraries making it portable to
any Linux distribution.                                                                               mBackground                       mDiffFit                       mProjectPP
   The size of the Montage binaries is 50 MB in total, and 20
MB in a compressed format, which is used for deployment.
We consider this deployment size as practical in most cases.
                                                                       Figure 5: Distribution of execution times of parallel
We should note that deployment of the function is performed
                                                                       tasks of the Montage 0.25 workflow.
only once, prior to workflow execution. Of course, when the
execution environment needs to instantiate the function or
create multiple instances for scale-out scenarios, the size of
each instance may affect performance, so users should try
                                                                       6.                         DISCUSSION
to minimize the volume of the deployment package. It is                   The experiments conducted with our prototype implemen-
also worth noting that such binary distributions are usually           tation confirm the feasibility of our approach to execution of
more compact than the full images of virtual machines used             scientific workflows in serverless infrastructures. There are,
in traditional IaaS clouds. Unfortunately, if the the source           however, some limitations that need to be emphasized here,
distribution or portable binary is not available, then it may          and some interesting implications for resource management
not be possible to deploy it as a cloud function. One use-             of scientific workflows in such infrastructures.
ful option would be to allow deployment of container-based
images, such as Docker images, but this is currently not               6.1                        Granularity of tasks
supported.                                                                Granularity of tasks is a crucial issue which determines
                                                                       whether a given application is well suited for processing us-
5.4      Variability                                                   ing serverless infrastructures. It is obvious that for long-
   Variability is an important metric of cloud infrastructures,        running HPC applications a dedicated supercomputer is a
since distribution and resource sharing often hamper consis-           better option. On the other hand, for high-throughput com-
tent performance. To measure the variability of GCF while              puting workloads distributed infrastructures such as grids
executing scientific workflows, we collected the duration of           and clouds have proven useful. Serverless infrastructures can
parallel task execution in the Montage (0.25 degree) work-             be considered similar to these high-throughput infrastruc-
flow – specifically, mBackground, mDiffFit and mProjectPP              tures, but they usually have shorter limits of task execution
– running 10 workflow instances over a period of one day.              size (300 seconds in the case of AWS Lambda, 60-second de-
   Results are shown in Fig. 5. We can see that the distri-            fault timeout for GCF). While these limits may vary, may be
bution of tasks is moderately wide, with the inter-quartile            configurable or may change over time, we must assume that
range of about 1 second width. The distribution is skewed              each infrastructure will always impose some kind of limit,
towards longer execution times, up to 7 seconds, while the             which will constrain the types of supported workflows to
median is about 4 seconds. It is important that we do not              those consisting of relatively fine-grained tasks. Many high-
observe any significant outliers. We have to note that the             throughput workflows can fit into these constraints, but for
execution times of the tasks themselves vary (they are not             the rest, other solutions should be developed, such as hybrid
identical) and that the task duration includes data transfer           approaches.
to/from cloud storage. Having taken this into account we
can conclude that the execution environment behaves con-               6.2                        Hybrid solutions
sistently in terms of performance, since the observed varia-              In addition to purely serverless solutions, one can pro-
tion is rather low. Further studies and long-term monitoring           pose hybrid approaches, such as the one outlined in Sec-
would be required to determine whether such consistency is             tion 3. The presented bridge model is a typical hybrid solu-
preserved over time.                                                   tion which combines traditional VMs with cloud functions
                                                                       for lightweight tasks. This architecture can overcome the
1
    http://montage.ipac.caltech.edu/                                   limitations of cloud functions, such as the need to create




                                                                  30
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah



custom binaries or execution time limits.                              to use R, AWS Lambda and the AWS API gateway to pro-
   The hybrid approach can also be used to minimize costs              cess a large number of tasks. Their use case is to compute
and optimize throughput. Such optimization should be based             some statistics for every gene in the genome, which gives
on cost analysis of leasing a VM and calling a cloud func-             about 20,000 tasks in an embarrassingly parallel problem.
tion, assuming that longer-term lease of resources typically           This work is similar to ours, but our approach is more gen-
corresponds to lower unit costs. This idea is generally ap-            eral, since we show how to implement generic support for
plicable to hybrid cloud solutions [21]. For example, it may           scientific workflows.
be more economical to lease VMs for long-running sequen-                  A detailed performance and cost comparison of traditional
tial parts of the workflow and trigger cloud functions for             clouds with microservices and the AWS Lambda serverless
parallel stages, where spawning VMs that are billed on an              architecture is presented in [19]. An enterprise application
hourly basis would be more costly. It may also prove in-               was benchmarked and results show that serverless infras-
teresting to combine cloud functions with spot instances or            tructures can introduce significant savings without impact-
burstable [11] instances, which are cheaper but have varying           ing performance. Similarly, in [20] the authors discuss the
performance and reliability characteristics.                           advantages of using cloud services and AWS Lambda for sys-
   The hybrid approach can also help resolve issues caused             tems that require higher resilience. They show how server-
by the statelessness and transience of cloud functions, where          less infrastructures can reduce costs in comparison to tradi-
no local data is preserved between function calls. By adding           tional IaaS resources and the spot market. Although these
a traditional VM as one of the executor units, data transfers          use cases are different from our scientific scenario, we believe
can be significantly reduced in the case of tasks that need            that serverless infrastructures offer an interesting option for
to access to the same set of data multiple times.                      scientific workflows.
                                                                          An interesting general discussion on the economics of hy-
6.3    Resource management and autoscaling                             brid clouds is presented in [21]. The author shows that even
   The core idea behind serverless infrastructures is that they        if when a private cloud is strictly cheaper (per unit) than
free the users from having to manage the server – and this             public clouds, a hybrid solution can result in a lower over-
also extends to clusters of servers. Decisions concerning re-          all cost in the case of a variable workload. We expect that
source management and autoscaling are thus made by the                 a similar effect can be observed in the case of a hybrid so-
platform based on the current workload, history, etc. This             lution combining traditional and serverless infrastructures
is useful for typical Web or mobile applications that have             for scientific applications which often have a wide range of
interactive usage patterns and whose workload depends on               granularity of tasks.
user behavior. With regard to scientific workflows which                  Regarding the use of alternative cloud solutions for sci-
have a well-defined structure, there is ongoing research on            entific applications, there is work on evaluation of Google
scheduling algorithms for clusters, grids and clouds. The              App Engine for scientific applications [16, 14]. Google App
goal of these algorithms it to optimize such criteria as time          Engine is a Platform-as-a-Service cloud, designed mostly
or cost of workflow execution, assuming that the user has              for Web applications, but with additional support for pro-
some control over the infrastructure. In the case of server-           cessing of background tasks. App Engine can be used for
less infrastructures the user does not have any control over           running parameter-study high-throughput computing work-
the execution environment. The providers would need to                 loads, and there are similar task processing time limits as in
change this policy by adding more control or the ability to            the case of serverless infrastructures. The difference is that
specify user preferences regarding the performance.                    the execution environment is more constrained, e.g. only one
   For example, users could specify priorities when deploying          application framework is allowed (such as Java or Python)
cloud functions, and a higher priority would mean faster re-           and there is no support for native code and access to local
sponse time, quicker autoscaling, etc., but at an additional           disk. For these reasons, we consider cloud functions such as
price. Lower-priority functions could have longer execution            AWS Lambda or Google Cloud Functions as a more inter-
times, possibly relying on resource scavenging, but at a lower         esting option for scientific applications.
cost. Another option would be to allow users to provide                   The concept of cloud functions can be considered as an
hints regarding expected execution times or anticipated par-           evolution of former remote procedure call concepts, such
allelism level. Such information could be useful for internal          as GridRPC [17], proposed and standardized for Grid com-
resource managers to better optimize the execution envi-               puting. The difference between these solutions and current
ronment and prepare for demand spikes, e.g. when many                  cloud functions is that the latter are supported by commer-
parallel tasks are launched by a workflow.                             cial cloud providers with emphasis on ease of use and de-
   Adding support for cooperation between the application              velopment productivity. Moreover, the granularity of tasks
and the internal resource manager of the cloud platform                processed by current cloud functions tends to be finer, so
would open an interesting area for research and optimiza-              we need to follow the development of these technologies to
tion of applications and infrastructures which both users and          further assess their applicability to scientific workflows.
providers could potentially benefit from.                                 A recently developed approach to decentralized workflow
                                                                       execution in clouds is represented by Flowbster [10], which
7.    RELATED WORK                                                     also aims at serverless infrastructures. We can expect that
                                                                       more such solutions will emerge in the near future.
   Although scientific workflows in clouds have been widely
                                                                          The architectural concepts of scientific workflows are dis-
studied, research focus is typically on IaaS and is little re-
                                                                       cussed in the context of component and service architec-
lated work regarding serverless or other alternative types of
                                                                       tures [7]. Cloud functions can be considered as a specific
infrastructures.
                                                                       class of services or components, which are stateless and can
   An example of using AWS Lambda for analyzing genomics
                                                                       be deployed in cloud infrastructures. They do not impose
data comes from the AWS blog [5]. The authors show how




                                                                  31
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah



any rules of composition, giving more freedom to developers.          acknowledged. The author would like to thank the Google
The most important distinction is that they are backed by             Cloud Functions team for the opportunity to use the alpha
the cloud infrastructure which is responsible for automatic           version of their service.
resource provisioning and scaling.
  The architectures of cloud workflow systems are also dis-           9.   REFERENCES
cussed in [12]. We believe that such architectures need to be          [1] AWS Lambda - Serverless Compute, 2016.
re-examined as new serverless infrastructures become more                  https://aws.amazon.com/lambda/.
widespread.
                                                                       [2] Cloud Functions - Serverless Microservices | Google
  Based on the discussion of related work we conclude that
                                                                           Cloud Platform, 2016.
our paper is likely the first attempt to use serverless clouds
                                                                           https://cloud.google.com/functions/.
for scientific workflows and we expect that more research in
                                                                       [3] B. Balis. HyperFlow: A model of computation,
this area will be needed as platforms become more mature.
                                                                           programming approach and enactment engine for
                                                                           complex distributed workflows. Future Generation
8.   SUMMARY AND FUTURE WORK                                               Computer Systems, 55:147–162, sep 2016.
   In this paper we have presented our approach to combin-             [4] B. Balis, K. Figiela, M. Malawski, M. Pawlik, and
ing scientific workflows with the emerging serverless clouds.              M. Bubak. A Lightweight Approach for Deployment of
We believe that such infrastructures based on the concept                  Scientific Workflows in Cloud Infrastructures. In
of cloud functions, such as AWS Lambda or Google Cloud                     R. Wyrzykowski, E. Deelman, J. Dongarra,
Functions, provide an interesting alternative not only for                 K. Karczewski, J. Kitowski, and K. Wiatr, editors,
typical enterprise applications, but also for scientific work-             Parallel Processing and Applied Mathematics: 11th
flows. We have discussed several options for designing server-             International Conference, PPAM 2015, Krakow,
less workflow execution architectures, including queue-based,              Poland, September 6-9, 2015. Revised Selected Papers,
direct executor, hybrid (bridged) and decentralized ones.                  Part I, pages 281–290, Cham, 2016. Springer
   To evaluate the feasibility of our approach we implemented              International Publishing.
a prototype based on the HyperFlow engine and Google                   [5] Bryan Liston. Analyzing Genomics Data at Scale using
Cloud Functions, and evaluated it with the real-world Mon-                 R, AWS Lambda, and Amazon API Gateway | AWS
tage application. Experiments with small-scale workflows                   Compute Blog, 2016. http://tinyurl.com/h7vyboo.
consisting of 43 and 107 tasks confirm that the GCF plat-              [6] E. Deelman, K. Vahi, G. Juve, M. Rynge,
form can be successfully used, and that it does not introduce              S. Callaghan, P. J. Maechling, R. Mayani, W. Chen,
significant delays. We have to note that the application                   R. Ferreira da Silva, M. Livny, and K. Wenger.
needs to prepared in a portable way to facilitate execution                Pegasus, a workflow management system for science
on such infrastructure and that this may be an issue for more              automation. Future Generation Computer Systems,
complex scientific software packages.                                      46:17–35, may 2015.
   Our paper also presents some implications of serverless             [7] D. Gannon. Component Architectures and Services:
infrastructures for resource management of scientific work-                From Application Construction to Scientific
flows. First, we observed that not all workloads are suitable              Workflows, pages 174–189. Springer London, London,
due to execution time limits, e.g. 5 minutes in the case of                2007.
AWS Lambda – accordingly, the granularity of tasks has to              [8] J. C. Jacob, D. S. Katz, G. B. Berriman, J. C. Good,
be taken into account. Next, we discussed how hybrid so-                   A. Laity, E. Deelman, C. Kesselman, G. Singh, M.-H.
lutions combining serverless and traditional infrastructures               Su, T. Prince, and Others. Montage: a grid portal and
can help optimize the performance and cost of scientific                   software toolkit for science-grade astronomical image
workflows. We also suggest that adding more control or                     mosaicking. International Journal of Computational
the ability to provide priorities or hints to cloud platforms              Science and Engineering, 4(2):73–87, 2009.
could benefit both providers and users in terms of optimizing
                                                                       [9] G. Juve, E. Deelman, K. Vahi, G. Mehta,
performance and cost.
                                                                           B. Berriman, B. P. Berman, and P. Maechling. Data
   Since this is a fairly new topic, we see many options for
                                                                           Sharing Options for Scientific Workflows on Amazon
future work. Further implementation work on development
                                                                           EC2. In SC ’10 Proceedings of the 2010 ACM/IEEE
and evaluation of various serverless architectures for sci-
                                                                           International Conference for High Performance
entific workflows is needed, with the decentralized option
                                                                           Computing, Networking, Storage and Analysis, SC ’10,
regarded as the greatest challenge. A more detailed per-
                                                                           pages 1–9. IEEE Computer Society, 2010.
formance evaluation of different classes of applications on
                                                                      [10] P. Kacsuk, J. Kovacs, and Z. Farkas. Flowbster:
various emerging infrastructures would also prove useful to
                                                                           Dynamic creation of data pipelines in clouds. In
better understand the possibilities and limitations of this
                                                                           Digital Infrastructures for Research event, Krakow,
approach. Finally, interesting research can be conducted in
                                                                           Poland, 28-30 September 2016, 2016.
the field of resource management for scientific workflows, to
design strategies and algorithms for optimizing time or cost          [11] P. Leitner and J. Scheuner. Bursting with Possibilities
of workflow execution in the emerging serverless clouds.                   – An Empirical Study of Credit-Based Bursting Cloud
                                                                           Instance Types, dec 2015.
                                                                      [12] X. Liu, D. Yuan, G. Zhang, W. Li, D. Cao, Q. He,
Acknowledgments                                                            J. Chen, and Y. Yang. The Design of Cloud Workflow
This work is partially supported by the National Centre                    Systems. Springer New York, New York, NY, 2012.
for Research and Development (NCBiR), Poland, project                 [13] M. Malawski, G. Juve, E. Deelman, and J. Nabrzyski.
PBS1/B9/18/2013. AGH grant no. 11.11.230.124 is also                       Algorithms for cost-and deadline-constrained




                                                                 32
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah



     provisioning for scientific workflow ensembles in IaaS
     clouds. Future Generation Computer Systems, 48:1–18,
     2015.
[14] M. Malawski, M. Kuzniar, P. Wojcik, and M. Bubak.
     How to Use Google App Engine for Free Computing.
     IEEE Internet Computing, 17(1):50–59, Jan 2013.
[15] M. Mao and M. Humphrey. Auto-scaling to minimize
     cost and meet application deadlines in cloud
     workflows. In SC ’11 Proceedings of 2011 International
     Conference for High Performance Computing,
     Networking, Storage and Analysis, SC ’11, Seattle,
     Washington, 2011. ACM.
[16] R. Prodan, M. Sperk, and S. Ostermann. Evaluating
     High-Performance Computing on Google App Engine.
     IEEE Software, 29(2):52–58, Mar 2012.
[17] K. Seymour, H. Nakada, S. Matsuoka, J. Dongarra,
     C. Lee, and H. Casanova. Overview of gridrpc: A
     remote procedure call api for grid computing. In
     International Workshop on Grid Computing, pages
     274–278. Springer, 2002.
[18] D. Thain, T. Tannenbaum, and M. Livny. Distributed
     computing in practice: the Condor experience.
     Concurrency and Computation: Practice and
     Experience, 17(2-4):323–356, 2005.
[19] M. Villamizar, O. Garces, L. Ochoa, H. Castro,
     L. Salamanca, M. Verano, R. Casallas, S. Gil,
     C. Valencia, A. Zambrano, and M. Lang.
     Infrastructure cost comparison of running web
     applications in the cloud using aws lambda and
     monolithic and microservice architectures. In 2016
     16th IEEE/ACM International Symposium on
     Cluster, Cloud and Grid Computing (CCGrid), pages
     179–182, May 2016.
[20] B. Wagner and A. Sood. Economics of Resilient Cloud
     Services. In 1st IEEE International Workshop on
     Cyber Resilience Economics, Aug 2016.
[21] J. Weinman. Hybrid Cloud Economics. IEEE Cloud
     Computing, 3(1):18–22, Jan 2016.




                                                              33