=Paper=
{{Paper
|id=Vol-1800/paper4
|storemode=property
|title=Towards Serverless Execution of Scientific Workflows - HyperFlow Case Study
|pdfUrl=https://ceur-ws.org/Vol-1800/paper4.pdf
|volume=Vol-1800
|authors=Maciej Malawski
|dblpUrl=https://dblp.org/rec/conf/sc/Malawski16
}}
==Towards Serverless Execution of Scientific Workflows - HyperFlow Case Study==
Towards Serverless Execution of Scientific Workflows – HyperFlow Case Study Maciej Malawski AGH University of Science and Technology Department of Computer Science Krakow, Poland malawski@agh.edu.pl ABSTRACT from having to maintain a server, including configuration Scientific workflows consisting of a high number of depen- and management of virtual machines, while resource man- dent tasks represent an important class of complex scientific agement is provided by the platform in an automated and applications. Recently, a new type of serverless infrastruc- scalable way. tures has emerged, represented by such services as Google In this paper we take a look at such serverless infras- Cloud Functions or AWS Lambda. In this paper we take tructures. Although designed mainly for processing back- a look at such serverless infrastructures, which are designed ground tasks of Web applications, we nevertheless investi- mainly for processing background tasks of Web applications. gate whether they can be applied to more compute- and We evaluate their applicability to more compute- and data- data-intensive scientific workflows. The main objectives of intensive scientific workflows and discuss possible ways to this paper are as follows: repurpose serverless architectures for execution of scientific workflows. A prototype workflow executor function has been • To present the main features of serverless infrastruc- developed using Google Cloud Functions and coupled with tures, comparing them to traditional infrastructure-as- the HyperFlow workflow engine. The function can run work- a-service clouds, flow tasks on the Google infrastructure, and features such ca- pabilities as data staging to/from Google Cloud Storage and • To discuss the options of using serverless infrastruc- execution of custom application binaries. We have success- tures for execution of scientific workflows, fully deployed and executed the Montage astronomic work- flow, often used as a benchmark, and we report on initial re- • To present our experience with a prototype implemented sults of performance evaluation. Our findings indicate that using HyperFlow [3] workflow engine and Google Cloud the simple mode of operation makes this approach easy to Functions (alpha version), use, although there are costs involved in preparing portable application binaries for execution in a remote environment. • To evaluate our approach using the Montage work- While our evaluation uses a preproduction (alpha) ver- flow [20], a real-world astronomic application, sion of the Google Cloud Functions platform, we find the presented approach highly promising. We also discuss pos- • To discuss the costs and benefits of this approach, to- sible future steps related to execution of scientific workflows gether with its implications for resource management in serverless infrastructures, and the implications with re- of scientific workflows in emerging infrastructures. gard to resource management for scientific applications in general. The paper is organized as follows. We begin with an overview of serverless infrastructures in Section 2. In Sec- tion 3 we propose and discuss alternative options for server- Keywords less architectures of scientific workflow systems. Our pro- Scientific workflows, cloud, serverless infrastructures totype implementation, based on HyperFlow and GCF, is described in Section 4. This is followed by evaluation us- ing the Montage application, presented in Section 5. We 1. INTRODUCTION discuss implications for resource management in Section 6 Scientific workflows consisting of a high number of depen- and present related work in Section 7. Section 8 provides a dent tasks represent an important class of complex scien- summary and description of future work. tific applications that have been successfully deployed and executed in traditional cloud infrastructures, including In- frastructure as a Service (IaaS) clouds. Recently, a new type 2. OVERVIEW OF SERVERLESS CLOUDS of serverless infrastructures emerge, represented by such ser- Writing “serverless” applications is a recent trend, mainly vices as Google Cloud Functions (GCF) [2] or AWS Lambda [1]. addressing Web applications. It frees programmers from These services allow deployment of software in the form of having to maintain a server – instead they can use a set functions that are executed in the provider’s infrastructure of existing cloud services directly from their application. in response to specific events such as new files being up- Examples of such services include cloud databases such as loaded to a cloud data store, messages arriving in queue Firebase or DynamoDB, messaging systems such as Google systems or direct HTTP calls. This approach frees the user Cloud Pub/Sub, notification services such as Amazon SNS Copyright held by the author(s). 25 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah and so on. When there is a need to execute custom applica- Serverless infrastructures can be cost-effective compared tion code in the background, special “cloud functions” (here- to standard VMs. For example, the aggregate cost of run- after simply referred to as functions) can be called. Exam- ning AWS Lambda functions with 1 GB memory for 1 hour ples of such functions are AWS Lambda and Google Cloud is $0.060012. This is more expensive than the t2.micro in- Functions (GCF). stance, which also has 1 GB of RAM but costs $0.013 per Both Lambda and GCF are based on the functional pro- hour. A T2.micro instance, however, offers only burstable gramming paradigm: a function is a piece of software that performance, which means only a fraction of CPU time per can be deployed on the providers’ cloud infrastructure and it hour is available. The smallest standard instance at AWS is performs a single operation in response to an external event. m3.medium, which costs $0.067 per hour, but gives 3.75 GB Functions can be triggered by: of RAM. Cloud functions are thus more suitable for vari- able load conditions while standard instances can be more • an event generated by the cloud infrastructure, e.g. economical for applications with stable workloads. a change in a cloud database, a file being uploaded to a cloud object store, a new item appearing in a messaging system, or an action scheduled at a specified 3. OPTIONS FOR EXECUTION OF SCIEN- time, TIFIC WORKFLOWS IN SERVERLESS INFRASTRUCTURES • a direct request from the application via HTTP or cloud API calls. In light of the identified features and limitations of server- less infrastructures and cloud functions, we can discuss the The cloud infrastructure which hosts the functions is re- option of using them for execution of scientific workflows. sponsible for automatic provisioning of resources (including We will start with a traditional execution model in IaaS CPU, memory, network and temporary storage), automatic cloud with no cloud functions (1), then present the queue scaling when the number of function executions varies over model (2), direct executor model (3), bridge model (4), and time, as well as monitoring and logging. The user is re- decentralized model (5). These options are schematically sponsible for providing executable code in a format required depicted in Fig. 1, and discussed in detail further on. by the framework. Typically, the execution environment is limited to a set of supported languages: Node.js, Java and 3.1 Traditional model Python in the case of AWS Lambda, and Node.js in the The traditional model assumes the workflow is running case of GCF. The user has no control over the execution in a standard IaaS cloud. In this model the workflow ex- environment, such as underlying operating system, version ecution follows the well-known master-worker architecture, of the runtime libraries, etc., but can use custom libraries where the master node runs a workflow engine, tasks that are with package managers and even upload binary code to be ready for execution are submitted to a queue, and worker executed. nodes process these tasks in parallel when possible. The Functions are thus different from Virtual Machines in IaaS master node can be deployed in the cloud or outside of the clouds where the users have full control over the OS (includ- cloud, while worker nodes are usually deployed as VMs in ing root access) and can customize the execution environ- a cloud infrastructure. The worker pool is typically created ment to their needs. On the other hand, functions free the on demand and can be dynamically scaled up or down de- developers from the need to configure, maintain, and man- pending on resource requirements. age server resources. Such a model is represented e.g. by Pegasus and Hyper- Cloud providers impose certain limits on the amount of re- Flow. The Pegasus Workflow Management System [6] uses sources a function can consume. In the case of AWS Lambda HTCondor [18] to maintain its queue and manage work- these limits are as follows: temporary disk space: 512 MB, ers. HyperFlow [3] is a lightweight workflow engine based number of processes and threads: 1024, maximum execution on Node.js – it uses RabbitMQ as its queue and AMQP duration per request: 300 seconds. There is also a limit of Executors on worker nodes. The deployment options of Hy- 100 concurrent executions per region, but this limit can be perFlow on grids and clouds are discussed in detail in [4]. increased on request. GCF, in its alpha version, does not In this model the user is responsible for management of specify limit thresholds. There is, however, a timeout pa- resources comprising the worker pool. The pool can be pro- rameter that can be provided when deploying a function and visioned statically, which is commonly done in practice, but the default value is 60 seconds. there is also ongoing research on automatic or dynamic re- Functions are thus different from permanent and state- source provisioning for workflow applications [13, 15], which ful services, since they are not long-running processes, but is a non-trivial task. rather serve individual tasks. Resource limits indicate that In the traditional cloud workflow processing model there such cloud functions are not currently suitable for large-scale is a need for some storage service to store input, output HPC applications, but can be useful for high-throughput and temporary data. There are multiple options for data computing workflows consisting of many fine-grained tasks. sharing [9], but one of the most widely used approaches is to Functions have a fine-grained pricing model associated rely on existing cloud storage, such as Amazon S3 or Google with them. In the case of AWS Lambda, the price is $0.20 Cloud Storage. This option has the advantage of providing per 1 million requests and $0.00001667 for every GB-second a permanent store so that data is not lost after the workflow used, defined as CPU time multiplied by the amount of execution is complete and the VMs are terminated. memory used. There are also additional charges for data transfer and storage (when DynamoDB or S3 is used). The 3.2 Queue model alpha version of Google Cloud Functions does not have a This model is similar to the traditional model: the master public pricing policy. node and the queue remain unchanged, but the worker is 26 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah Traditional Queue Direct executor Bridge model model model model Decentralized model Engine Engine Engine Engine Workers Queue Queue Cloud Functions Queue Cloud Functions Workers Workers Storage Bridge worker Storage VM VM Workers VM VM VM Cloud Functions Storage Storage Cloud Functions Storage Figure 1: Options of serverless architectures for execution of scientific workflows. replaced by a cloud function. Instead of running a pool calls. Regarding development effort, it requires changes in of VMs with workers a set of cloud functions is spawned. the master and a new implementation of the worker. Each function fetches a task from a queue and processes it, An advantage of this model is its cleanness and simplic- returning results via the queue. ity, but these come at the cost of tight master-worker cou- The main advantage of this model is its simplicity, since pling Accordingly, it becomes more difficult to implement it only requires changes in the worker module. This may the multi-cloud scenario, since the workflow engine would be simple if the queue uses a standard protocol, such as need to be able to dispatch tasks to multiple cloud function AMQP in the case of HyperFlow Executor, but in the case providers. of Pegasus and HTCondor a Condor daemon (condor startd) must run on the worker node and communicate using a pro- 3.4 Bridge model prietary Condor protocol. In this scenario implementing a This solution is more complex but it preserves the de- worker as a cloud function would require more effort. coupling of the master from the worker, using a queue. In Another advantage of the presented model is the ability this case the master and the queue remain unchanged, but to combine the workers implemented as functions with other a new type of bridge worker is added. It fetches tasks from workers running e.g. in a local cluster or in a traditional the queue and dispatches them to the cloud functions. Such cloud. This would also enable concurrent usage of cloud a worker needs to run as a separate service (daemon) and functions from multiple providers (e.g. AWS and Google) can trigger cloud functions using the provider-specific API. when such a multi-cloud scenario is required. The decoupling of the master from the worker allows for An important issue associated with the queue model is more complex and flexible scenarios, including multi-cloud how to trigger the execution of the functions. If a native deployments. A set of bridge workers can be spawned, each implementation of the queue is used (e.g. RabbitMQ as in dispatching tasks to a different cloud function provider. More- HyperFlow), it is necessary to trigger a function for each task over, a pool of workers running in external distributed plat- added to the queue. This can be done by the workflow engine forms, such as third-party clouds or clusters, can be used or by a dedicated queue monitoring service. Other options together with cloud functions. include periodic function execution or recursive execution: a function can itself trigger other functions once it finishes 3.5 Decentralized model processing data. This model re-implements the whole workflow engine in a To ensure a clean serverless architecture another option is distributed way using cloud functions. Each task of a work- to implement the queue using a native cloud service which is flow is processed by a separate function. These functions already integrated with cloud functions. In the case of AWS can be triggered by (a) new data items uploaded to cloud Lambda one could implement the queue using DynamoDB: storage, or (b) other cloud functions, i.e. predecessor tasks here, a function could be triggered by adding a new item to triggering their successor tasks following completion. Op- a task table. In the case of GFC, a Google Cloud Pub/Sub tion (a) can be used to represent data dependencies in a service can be used for the same purpose. Such a solution, workflow while option (b) can be used to represent control however, would require more changes in the workflow engine dependencies. and would not be easy to deploy in multi-cloud scenarios. In the decentralized model the structure and state of work- flow execution have to be preserved in the system. The sys- 3.3 Direct executor model tem can be implemented in a fully distributed way, by de- This is the simplest model and requires only a workflow ploying a unique function for each task in the workflow. In engine and a cloud function that serves as a task executor. this way, the workflow structure is mapped to a set of func- It eliminates the need for a queue since the workflow en- tions and the execution state propagates by functions being gine can trigger the cloud function directly via API/HTTP triggered by their predecessors. Another option is to deploy 27 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah HyperFlow The schematic diagram of the prototype is shown in Fig. 2. Engine The HyperFlow engine is extended with the GCFCommand function which is responsible for communication with GCF. GCF It is a replacement for AMQPCommand function, which is used Command in the standard HyperFlow distributed deployment with AMQP protocol and RabbitMQ. GCFCommand sends the task de- scription in a JSON-encoded message to the cloud function. GCF Worker The GCF Executor is the main cloud function which needs Executor to be deployed on the GCF platform. It processes the mes- sage, and uses the Storage Client for staging in and out the Storage input and output data. It uses Google Cloud Storage and Client requests parallel transfers to speed up download and upload Cloud of data. GCF Executor calls the executable which needs to Functions be deployed together with the function. GCF supports run- ning own Linux-based custom binaries, but the user has to Storage make sure that the binary is portable, e.g. by statically link- ing all of its dependencies. Our architecture is thus purely serverless, with the HyperFlow engine running on a client Figure 2: Architecture of the prototype based on machine and directly relying only on cloud services such as HyperFlow and Google Cloud Functions GCF and Cloud Storage. 4.2 Fault tolerance a generic task executor function and maintain the workflow Transient failures are a common risk in cloud environ- state in a database, possibly one provided as a cloud service. ments. Since execution of a possibly large volume of con- The advantages of the decentralized approach include fully current HTTP requests in a distributed environment is al- distributed and serverless execution, without the need to ways prone to errors caused by various layers of network and maintain a workflow engine. The required development ef- middleware stacks, the execution engine needs to be able to fort is extensive, since it requires re-implementation of the handle such failures gracefully and attempt to retry failed whole workflow engine. A detailed design of such an en- requests. gine is out of scope of this paper, but remains an interesting In the case of HyperFlow, the Node.js ecosystem appears subject of future research. very helpful in this context. We used the requestretry li- brary for implementing the HTTP client, which allows for 3.6 Summary of options automatic retry of failed requests with a configurable num- As we can see, cloud functions provide multiple integra- ber of retries (default: 5) and delay between retries (default: tion options with scientific workflow engines. The users need 5 seconds). Our prototype uses these default settings, but to decide which option is best for them based on their re- in the future it will be possible to explore more advanced quirements, most notably the allowed level of coupling be- error handling policies taking into account error types and tween the workflow engine and the infrastructure and the patterns. need to run hybrid or cross-cloud deployments where re- sources from more than one provider are used in parallel. We consider the fully decentralized option as an interesting 5. EVALUATION USING MONTAGE WORK- future research direction, while in the following sections we FLOW will focus on our experience with a prototype implemented Based on our prototype which combines HyperFlow and using the direct executor model. Google Cloud Functions, we performed several experiments to evaluate our approach. The goals of the evaluation are as 4. PROTOTYPE BASED ON HYPERFLOW follows: To evaluate the feasibility of our approach we decided to develop a prototype using the HyperFlow engine and Google • To validate the feasibility of our approach, i.e. to Cloud Functions, applying the direct executor model. This determine whether it is practical to execute scientific decision has several reasons. First, HyperFlow is imple- workflows in serverless infrastructures. mented in Node.js, while GCF supports Node.js as a native function execution environment. This good match simpli- • To measure performance characteristics of the execu- fies development and debugging, which is always non-trivial tion environment in order to provide hints for resource in a distributed environment. Our selection of the direct management. execution model was motivated by the extensible design of HyperFlow, which can associate with each task in a work- Details regarding our sample application, experiment setup flow a specific executor function responsible for handling and results are provided below. command-line tasks. Since GCF provides a direct trigger- ing mechanism of cloud functions using HTTP calls, we can 5.1 Montage workflow and experiment setup apply existing HTTP client libraries for Node.js, plugging support for GCF into HyperFlow as a natural extension. Montage application. For our study we selected the Montage [8] application, 4.1 Architecture and components which is an astronomic workflow. It is often used for various 28 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah benchmarks and performance evaluation, since it is open- source and has been widely studied by the research com- mJPEG munity. The application processes a set of input images mShrink mAdd from astronomic sky surveys and constructs a single large- mImgtbl scale mosaic image. The structure of the workflow is shown mBackground mBackground in Fig. 3: it consists of several stages which include paral- mBackground lel processing sections, reduction operations and sequential mBackground mBackground processing. mBackground mBackground mBackground mProjectPP mBackground mBackground mDiffFit mBgModel mConcatFit mDiffFit task mConcatFit mDiffFit mDiffFit mDiffFit mAdd mBgModel Task mDiffFit mDiffFit mBackground mBackground mDiffFit mDiffFit mImgTbl mDiffFit mBgModel mDiffFit mAdd mDiffFit mDiffFit mConcatFit mDiffFit mShrink mDiffFit mDiffFit mDiffFit mJPEG mDiffFit mImgtbl mDiffFit mProjectPP Figure 3: Structure of the Montage workflow used mProjectPP mJPEG mProjectPP for experiments mProjectPP mProjectPP mProjectPP mProjectPP The size of the workflow, i.e. the number of tasks, de- mProjectPP mShrink pends on the size of the area of the target image, which mProjectPP mProjectPP is measured in angular degrees. For example, a small-scale mProjectPP 0.25-degree Montage workflow consists of 43 tasks, with 10 parallel mProjectPP tasks and 17 mDiffFit tasks, while more 0 25 50 75 100 complex workflows can involve thousands of tasks. In our Time in seconds experiments we used the Montage 0.25 workflow with 43 tasks, and the Montage 0.4 workflow with 107 tasks. Figure 4: Sample run of Montage 0.25 workflow. Experiment setup. We used a recent version of HyperFlow and an Alpha ver- sion of Google Cloud Functions. The HyperFlow engine was execution traces we plotted Gantt charts. Altogether, sev- installed on a client machine with Ubuntu 14.04 LTS Linux eral runs were performed and an example execution trace and Node.js 4.5.0. For staging the input and output data, (representative of all runs) is shown in Fig. 4. as well as for temporary storage, we used a Google Cloud Montage 0.25 is a relatively small-scale workflow, but the Storage bucket with standard options. Both Cloud Func- resulting plot clearly reveals that the cloud function-based tions and Cloud Storage were located in the us-cental-1 approach works well in this case. We can observe that the region, while the client machine was located in Europe. parallel tasks of the workflow (mProjectPP, mDiffFit amd mBackground) are indeed short-running and can be pro- Data preparation and handling. cessed in parallel. The user has no control over the level of To run the Montage workflow in our experiments all in- parallelism, but the cloud platform is able to process tasks put data needs to be uploaded to the cloud storage first. For in a scalable way, as stated in the documentation. We also each workflow run, a separate subfolder in the Cloud Storage observe no significant delays between task execution and can bucket is created. The subfolder is then used for exchange of attribute this to the fact that the requests between Hyper- intermediate data and for storing the final results. Data can Flow engine and the cloud functions are transmitted using be conveniently uploaded using a command-line tool which HTTP over a wide-area network, including a trans-Atlantic supports parallel transfers. The web-based Google Cloud connection. Console is useful for browsing results and displaying the re- Similar results were obtained for the Montage 0.4 work- sulting JPEG images. flow which consists of 107 tasks, however the corresponding detailed plots are not reproduced here for reasons of read- 5.2 Feasibility ability. It should be noted that while the parallel tasks of To assess the feasibility of our approach we tested our Montage are relatively fine-grained, the execution time of se- prototype using the Montage 0.25 workflow. We collected quential processing tasks such as mImgTbl and mAdd grows task execution start and finish timestamps, which give the with the size of the workflow and can exceed the default limit total duration of cloud function execution. This execution of 60 seconds imposed upon cloud function execution. This time also includes data transfers. Based on the collected limit can be extended when deploying the cloud function, 29 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah but there is currently no information regarding the maxi- mum duration of such requests. We can only expect that Duration of tasks such limits will increase as the platforms become more ma- 7 ● ● ture. This was indeed the case with Google App Engine, where the initial request limit was increased from 30 sec- ● ● ● ● ● ● onds to 10 minutes [14]. ●● ● ● ● ● ● 6 ● ● ● ● 5.3 Deployment size and portability ● ● ● ● Duration in seconds ● ● ● ● ● ● ● ● Our current approach requires us to deploy the cloud func- ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● tion together with all the application binaries. The Google 5 ● ● ● ● ●● ●● ● ● ● ●● ● Cloud Function execution environment enables inclusion of ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● dependencies in Node.js libraries packaged using Node Pack- ● ● ●● ●● ● ● ● ● ● ●● ● ● ● age Manager (NPM) which are automatically installed when ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ●●● ● ●● the function is deployed. Moreover, the user can provide a 4 ●● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●●●●●●● ● ● ● ● ● ● ●●● ●● ● ● ●● set of JavaScript source files, configuration and binary de- ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ●● ●●● ● ● ●●● ●● ●● ● ● ● ● ● ● ●● pendencies to be uploaded together with the function. ●● ● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ●●● ● ●●● ● ● ●● ● ● ● In the case of the Montage application, the users need ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● to prepare application binaries in a portable format. Since 3 ● ●● ● ● ● ● ● ● ● ● ● Montage is distributed in source1 format, it can be compiled ● ● ● ●●● ● ● ● ● and statically linked with all libraries making it portable to any Linux distribution. mBackground mDiffFit mProjectPP The size of the Montage binaries is 50 MB in total, and 20 MB in a compressed format, which is used for deployment. We consider this deployment size as practical in most cases. Figure 5: Distribution of execution times of parallel We should note that deployment of the function is performed tasks of the Montage 0.25 workflow. only once, prior to workflow execution. Of course, when the execution environment needs to instantiate the function or create multiple instances for scale-out scenarios, the size of each instance may affect performance, so users should try 6. DISCUSSION to minimize the volume of the deployment package. It is The experiments conducted with our prototype implemen- also worth noting that such binary distributions are usually tation confirm the feasibility of our approach to execution of more compact than the full images of virtual machines used scientific workflows in serverless infrastructures. There are, in traditional IaaS clouds. Unfortunately, if the the source however, some limitations that need to be emphasized here, distribution or portable binary is not available, then it may and some interesting implications for resource management not be possible to deploy it as a cloud function. One use- of scientific workflows in such infrastructures. ful option would be to allow deployment of container-based images, such as Docker images, but this is currently not 6.1 Granularity of tasks supported. Granularity of tasks is a crucial issue which determines whether a given application is well suited for processing us- 5.4 Variability ing serverless infrastructures. It is obvious that for long- Variability is an important metric of cloud infrastructures, running HPC applications a dedicated supercomputer is a since distribution and resource sharing often hamper consis- better option. On the other hand, for high-throughput com- tent performance. To measure the variability of GCF while puting workloads distributed infrastructures such as grids executing scientific workflows, we collected the duration of and clouds have proven useful. Serverless infrastructures can parallel task execution in the Montage (0.25 degree) work- be considered similar to these high-throughput infrastruc- flow – specifically, mBackground, mDiffFit and mProjectPP tures, but they usually have shorter limits of task execution – running 10 workflow instances over a period of one day. size (300 seconds in the case of AWS Lambda, 60-second de- Results are shown in Fig. 5. We can see that the distri- fault timeout for GCF). While these limits may vary, may be bution of tasks is moderately wide, with the inter-quartile configurable or may change over time, we must assume that range of about 1 second width. The distribution is skewed each infrastructure will always impose some kind of limit, towards longer execution times, up to 7 seconds, while the which will constrain the types of supported workflows to median is about 4 seconds. It is important that we do not those consisting of relatively fine-grained tasks. Many high- observe any significant outliers. We have to note that the throughput workflows can fit into these constraints, but for execution times of the tasks themselves vary (they are not the rest, other solutions should be developed, such as hybrid identical) and that the task duration includes data transfer approaches. to/from cloud storage. Having taken this into account we can conclude that the execution environment behaves con- 6.2 Hybrid solutions sistently in terms of performance, since the observed varia- In addition to purely serverless solutions, one can pro- tion is rather low. Further studies and long-term monitoring pose hybrid approaches, such as the one outlined in Sec- would be required to determine whether such consistency is tion 3. The presented bridge model is a typical hybrid solu- preserved over time. tion which combines traditional VMs with cloud functions for lightweight tasks. This architecture can overcome the 1 http://montage.ipac.caltech.edu/ limitations of cloud functions, such as the need to create 30 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah custom binaries or execution time limits. to use R, AWS Lambda and the AWS API gateway to pro- The hybrid approach can also be used to minimize costs cess a large number of tasks. Their use case is to compute and optimize throughput. Such optimization should be based some statistics for every gene in the genome, which gives on cost analysis of leasing a VM and calling a cloud func- about 20,000 tasks in an embarrassingly parallel problem. tion, assuming that longer-term lease of resources typically This work is similar to ours, but our approach is more gen- corresponds to lower unit costs. This idea is generally ap- eral, since we show how to implement generic support for plicable to hybrid cloud solutions [21]. For example, it may scientific workflows. be more economical to lease VMs for long-running sequen- A detailed performance and cost comparison of traditional tial parts of the workflow and trigger cloud functions for clouds with microservices and the AWS Lambda serverless parallel stages, where spawning VMs that are billed on an architecture is presented in [19]. An enterprise application hourly basis would be more costly. It may also prove in- was benchmarked and results show that serverless infras- teresting to combine cloud functions with spot instances or tructures can introduce significant savings without impact- burstable [11] instances, which are cheaper but have varying ing performance. Similarly, in [20] the authors discuss the performance and reliability characteristics. advantages of using cloud services and AWS Lambda for sys- The hybrid approach can also help resolve issues caused tems that require higher resilience. They show how server- by the statelessness and transience of cloud functions, where less infrastructures can reduce costs in comparison to tradi- no local data is preserved between function calls. By adding tional IaaS resources and the spot market. Although these a traditional VM as one of the executor units, data transfers use cases are different from our scientific scenario, we believe can be significantly reduced in the case of tasks that need that serverless infrastructures offer an interesting option for to access to the same set of data multiple times. scientific workflows. An interesting general discussion on the economics of hy- 6.3 Resource management and autoscaling brid clouds is presented in [21]. The author shows that even The core idea behind serverless infrastructures is that they if when a private cloud is strictly cheaper (per unit) than free the users from having to manage the server – and this public clouds, a hybrid solution can result in a lower over- also extends to clusters of servers. Decisions concerning re- all cost in the case of a variable workload. We expect that source management and autoscaling are thus made by the a similar effect can be observed in the case of a hybrid so- platform based on the current workload, history, etc. This lution combining traditional and serverless infrastructures is useful for typical Web or mobile applications that have for scientific applications which often have a wide range of interactive usage patterns and whose workload depends on granularity of tasks. user behavior. With regard to scientific workflows which Regarding the use of alternative cloud solutions for sci- have a well-defined structure, there is ongoing research on entific applications, there is work on evaluation of Google scheduling algorithms for clusters, grids and clouds. The App Engine for scientific applications [16, 14]. Google App goal of these algorithms it to optimize such criteria as time Engine is a Platform-as-a-Service cloud, designed mostly or cost of workflow execution, assuming that the user has for Web applications, but with additional support for pro- some control over the infrastructure. In the case of server- cessing of background tasks. App Engine can be used for less infrastructures the user does not have any control over running parameter-study high-throughput computing work- the execution environment. The providers would need to loads, and there are similar task processing time limits as in change this policy by adding more control or the ability to the case of serverless infrastructures. The difference is that specify user preferences regarding the performance. the execution environment is more constrained, e.g. only one For example, users could specify priorities when deploying application framework is allowed (such as Java or Python) cloud functions, and a higher priority would mean faster re- and there is no support for native code and access to local sponse time, quicker autoscaling, etc., but at an additional disk. For these reasons, we consider cloud functions such as price. Lower-priority functions could have longer execution AWS Lambda or Google Cloud Functions as a more inter- times, possibly relying on resource scavenging, but at a lower esting option for scientific applications. cost. Another option would be to allow users to provide The concept of cloud functions can be considered as an hints regarding expected execution times or anticipated par- evolution of former remote procedure call concepts, such allelism level. Such information could be useful for internal as GridRPC [17], proposed and standardized for Grid com- resource managers to better optimize the execution envi- puting. The difference between these solutions and current ronment and prepare for demand spikes, e.g. when many cloud functions is that the latter are supported by commer- parallel tasks are launched by a workflow. cial cloud providers with emphasis on ease of use and de- Adding support for cooperation between the application velopment productivity. Moreover, the granularity of tasks and the internal resource manager of the cloud platform processed by current cloud functions tends to be finer, so would open an interesting area for research and optimiza- we need to follow the development of these technologies to tion of applications and infrastructures which both users and further assess their applicability to scientific workflows. providers could potentially benefit from. A recently developed approach to decentralized workflow execution in clouds is represented by Flowbster [10], which 7. RELATED WORK also aims at serverless infrastructures. We can expect that more such solutions will emerge in the near future. Although scientific workflows in clouds have been widely The architectural concepts of scientific workflows are dis- studied, research focus is typically on IaaS and is little re- cussed in the context of component and service architec- lated work regarding serverless or other alternative types of tures [7]. Cloud functions can be considered as a specific infrastructures. class of services or components, which are stateless and can An example of using AWS Lambda for analyzing genomics be deployed in cloud infrastructures. They do not impose data comes from the AWS blog [5]. The authors show how 31 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah any rules of composition, giving more freedom to developers. acknowledged. The author would like to thank the Google The most important distinction is that they are backed by Cloud Functions team for the opportunity to use the alpha the cloud infrastructure which is responsible for automatic version of their service. resource provisioning and scaling. The architectures of cloud workflow systems are also dis- 9. REFERENCES cussed in [12]. We believe that such architectures need to be [1] AWS Lambda - Serverless Compute, 2016. re-examined as new serverless infrastructures become more https://aws.amazon.com/lambda/. widespread. [2] Cloud Functions - Serverless Microservices | Google Based on the discussion of related work we conclude that Cloud Platform, 2016. our paper is likely the first attempt to use serverless clouds https://cloud.google.com/functions/. for scientific workflows and we expect that more research in [3] B. Balis. HyperFlow: A model of computation, this area will be needed as platforms become more mature. programming approach and enactment engine for complex distributed workflows. Future Generation 8. SUMMARY AND FUTURE WORK Computer Systems, 55:147–162, sep 2016. In this paper we have presented our approach to combin- [4] B. Balis, K. Figiela, M. Malawski, M. Pawlik, and ing scientific workflows with the emerging serverless clouds. M. Bubak. A Lightweight Approach for Deployment of We believe that such infrastructures based on the concept Scientific Workflows in Cloud Infrastructures. In of cloud functions, such as AWS Lambda or Google Cloud R. Wyrzykowski, E. Deelman, J. Dongarra, Functions, provide an interesting alternative not only for K. Karczewski, J. Kitowski, and K. Wiatr, editors, typical enterprise applications, but also for scientific work- Parallel Processing and Applied Mathematics: 11th flows. We have discussed several options for designing server- International Conference, PPAM 2015, Krakow, less workflow execution architectures, including queue-based, Poland, September 6-9, 2015. Revised Selected Papers, direct executor, hybrid (bridged) and decentralized ones. Part I, pages 281–290, Cham, 2016. Springer To evaluate the feasibility of our approach we implemented International Publishing. a prototype based on the HyperFlow engine and Google [5] Bryan Liston. Analyzing Genomics Data at Scale using Cloud Functions, and evaluated it with the real-world Mon- R, AWS Lambda, and Amazon API Gateway | AWS tage application. Experiments with small-scale workflows Compute Blog, 2016. http://tinyurl.com/h7vyboo. consisting of 43 and 107 tasks confirm that the GCF plat- [6] E. Deelman, K. Vahi, G. Juve, M. Rynge, form can be successfully used, and that it does not introduce S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, significant delays. We have to note that the application R. Ferreira da Silva, M. Livny, and K. Wenger. needs to prepared in a portable way to facilitate execution Pegasus, a workflow management system for science on such infrastructure and that this may be an issue for more automation. Future Generation Computer Systems, complex scientific software packages. 46:17–35, may 2015. Our paper also presents some implications of serverless [7] D. Gannon. Component Architectures and Services: infrastructures for resource management of scientific work- From Application Construction to Scientific flows. First, we observed that not all workloads are suitable Workflows, pages 174–189. Springer London, London, due to execution time limits, e.g. 5 minutes in the case of 2007. AWS Lambda – accordingly, the granularity of tasks has to [8] J. C. Jacob, D. S. Katz, G. B. Berriman, J. C. Good, be taken into account. Next, we discussed how hybrid so- A. Laity, E. Deelman, C. Kesselman, G. Singh, M.-H. lutions combining serverless and traditional infrastructures Su, T. Prince, and Others. Montage: a grid portal and can help optimize the performance and cost of scientific software toolkit for science-grade astronomical image workflows. We also suggest that adding more control or mosaicking. International Journal of Computational the ability to provide priorities or hints to cloud platforms Science and Engineering, 4(2):73–87, 2009. could benefit both providers and users in terms of optimizing [9] G. Juve, E. Deelman, K. Vahi, G. Mehta, performance and cost. B. Berriman, B. P. Berman, and P. Maechling. Data Since this is a fairly new topic, we see many options for Sharing Options for Scientific Workflows on Amazon future work. Further implementation work on development EC2. In SC ’10 Proceedings of the 2010 ACM/IEEE and evaluation of various serverless architectures for sci- International Conference for High Performance entific workflows is needed, with the decentralized option Computing, Networking, Storage and Analysis, SC ’10, regarded as the greatest challenge. A more detailed per- pages 1–9. IEEE Computer Society, 2010. formance evaluation of different classes of applications on [10] P. Kacsuk, J. Kovacs, and Z. Farkas. Flowbster: various emerging infrastructures would also prove useful to Dynamic creation of data pipelines in clouds. In better understand the possibilities and limitations of this Digital Infrastructures for Research event, Krakow, approach. Finally, interesting research can be conducted in Poland, 28-30 September 2016, 2016. the field of resource management for scientific workflows, to design strategies and algorithms for optimizing time or cost [11] P. Leitner and J. Scheuner. Bursting with Possibilities of workflow execution in the emerging serverless clouds. – An Empirical Study of Credit-Based Bursting Cloud Instance Types, dec 2015. [12] X. Liu, D. Yuan, G. Zhang, W. Li, D. Cao, Q. He, Acknowledgments J. Chen, and Y. Yang. The Design of Cloud Workflow This work is partially supported by the National Centre Systems. Springer New York, New York, NY, 2012. for Research and Development (NCBiR), Poland, project [13] M. Malawski, G. Juve, E. Deelman, and J. Nabrzyski. PBS1/B9/18/2013. AGH grant no. 11.11.230.124 is also Algorithms for cost-and deadline-constrained 32 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah provisioning for scientific workflow ensembles in IaaS clouds. Future Generation Computer Systems, 48:1–18, 2015. [14] M. Malawski, M. Kuzniar, P. Wojcik, and M. Bubak. How to Use Google App Engine for Free Computing. IEEE Internet Computing, 17(1):50–59, Jan 2013. [15] M. Mao and M. Humphrey. Auto-scaling to minimize cost and meet application deadlines in cloud workflows. In SC ’11 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, Seattle, Washington, 2011. ACM. [16] R. Prodan, M. Sperk, and S. Ostermann. Evaluating High-Performance Computing on Google App Engine. IEEE Software, 29(2):52–58, Mar 2012. [17] K. Seymour, H. Nakada, S. Matsuoka, J. Dongarra, C. Lee, and H. Casanova. Overview of gridrpc: A remote procedure call api for grid computing. In International Workshop on Grid Computing, pages 274–278. Springer, 2002. [18] D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 17(2-4):323–356, 2005. [19] M. Villamizar, O. Garces, L. Ochoa, H. Castro, L. Salamanca, M. Verano, R. Casallas, S. Gil, C. Valencia, A. Zambrano, and M. Lang. Infrastructure cost comparison of running web applications in the cloud using aws lambda and monolithic and microservice architectures. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pages 179–182, May 2016. [20] B. Wagner and A. Sood. Economics of Resilient Cloud Services. In 1st IEEE International Workshop on Cyber Resilience Economics, Aug 2016. [21] J. Weinman. Hybrid Cloud Economics. IEEE Cloud Computing, 3(1):18–22, Jan 2016. 33