10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018 Parsl: Scalable Parallel Scripting in Python Yadu Babuji∗ , Kyle Chard∗ , Ian Foster∗ , Daniel S. Katz§ , Michael Wilde∗ , Anna Woodard∗ , and Justin Wozniak∗ ∗ Computation Institute, University of Chicago & Argonne National Laboratory, Chicago, IL, USA § National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL, USA Abstract—Computational and data-driven research practices selected Python functions and external applications (called have significantly changed over the past decade to encompass Apps) to be connected by shared input/output data objects new analysis models such as interactive and online computing. into flexible parallel workflows. Parsl abstracts the specific Science gateways are simultaneously evolving to support this transforming landscape with the aim to enable transparent, execution environment, allowing the same script to be exe- scalable execution of a variety of analyses. Science gateways often cuted on arbitrary multicore processors, clusters, clouds, and rely on workflow management systems to represent and execute supercomputers. analyses efficiently and reliably. However, integrating workflow When a Parsl script is executed, the Parsl library causes systems in science gateways can be challenging, especially as annotated functions (Apps) to be intercepted by the Parsl analyses become more interactive and dynamic, requiring so- phisticated orchestration and management of applications and execution fabric, which captures and serializes their param- data, and customization for specific execution environments. Parsl eters, analyzes their dependencies, and runs them on selected (Parallel Scripting Library), a Python library for programming resources, referred to as sites. The execution fabric brings and executing data-oriented workflows in parallel, addresses dependency awareness to Apps by introducing data futures as these problems. Developers simply annotate a Python script with the inputs and outputs of Apps. Apps that use a data future as Parsl directives wrapping either Python functions or calls to external applications. Parsl manages the execution of the script on an input can be enqueued but will be blocked until that data clusters, clouds, grids, and other resources; orchestrates required future has been written. This feature allows Apps to execute data movement; and manages the execution of Python functions in parallel whenever they do not share dependencies or their and external applications in parallel. The Parsl library can be data dependencies have been resolved. easily integrated into Python-based gateways, allowing for simple Fig. 1 depicts how Parsl interacts with its environment, management and scaling of workflows. Parsl, Parallel scripting, Python, Scientific Workflows— including code, data, and resources. Parsl provides several advantages to science gateways: it allows a single script to be executed on any computing infrastructures from clouds to I. I NTRODUCTION supercomputers; it provides fault tolerance, automated elastic- Data-driven research methodologies have had a disruptive ity, and support for various execution models; it handles data impact on science, enabling new types of exploration and management by staging local data through its secure message facilitating new discoveries [1], [2], [3]. Underlying these queue and by managing wide area transfers with Globus [9]; methodologies are new tools and technologies such as Jupyter and it can be trivially integrated via its Python interface. notebooks for interactive analysis, scripting languages for In this paper we describe Parsl, highlighting how it allows flexible exploration, and a suite of libraries like Pandas and standard Python scripts and science gateways to be augmented scikit-learn that facilitate cutting-edge analyses. to execute complex workflows and facilitate parallel execution. Science gateways [4] have long supported the varied needs We describe Parsl’s unique capabilities and present several of users, providing intuitive interfaces for end users to ac- example workflows that are common in science gateways from cess both data and computing capabilities. Science gate- computational chemistry, materials science, and biology, to way frameworks, such as Apache Airavata [5] and WS- highlight the power of the approach. PGRADE/gUSE [6], often rely on workflow frameworks to represent and execute analyses that benefit from extensibility, II. W ORKFLOW MODELS scalability, and robustness [7]. However, there are two signif- Parsl is designed to support not only traditional many- icant challenges associated with current approaches: 1) many task workflow models but also new analysis models that are workflow engines are focused on many task applications rather and will be increasingly supported by science gateways (e.g., than interactive, online, or machine learning analyses; and 2) online and interactive computing). We briefly describe three workflow engines are not easily integrated into external ser- such workflow models that can be supported by Parsl. vices (e.g., gateways) due to issues such as language mismatch Workflows have long been applied to a range of many- and the need for intermediate workflow representations. task applications, for example protein-ligand docking for drug Here we present Parsl, a Python parallel scripting library screening [10]. Here, workflows are used to orchestrate a series that supports the development and execution of asynchronous of external applications to be applied to a large set of input and implicitly parallel data-oriented workflows. Building on data. For example, in drug screening, dozens of proteins are the model used by the Swift workflow language [8], Parsl evaluated against hundreds of thousands of drug candidates to brings parallel workflow capabilities to scripts, applications, identify the location and orientation of a ligand that binds to and gateways implemented in Python. Parsl scripts allow a protein receptor. The top candidates are then processed with 10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018 @python_app def hello(): return 'Hello World!' @bash_app def hello(inputs=[], outputs=[], stdout=None, stderr=None): return 'echo "Hello World"' Listing 1: Two examples of Parsl Apps. Parsl launches asynchronous Apps and passes futures to other Apps in lieu of computing results synchronously. The Fig. 1: Parsl environment. DFK is responsible for managing a script’s execution, making ordinary functions aware of futures and ensuring the execu- tion of these functions are conditional on the resolution of detailed molecular dynamics simulations to identify the most all dependent futures. This enables completely asynchronous likely combinations to be used for further experimentation. management of all launched tasks with the data dependencies Gateways such as MoSGrid [11] and Galaxy [12] support such alone determining the order of execution. workflows. Apps: A Parsl script is comprised of standard Python code Discovery science represents a new research methodology plus a number of Apps—annotated units of Python code based on explorative, interactive analysis. The general model or external applications that specify their input and output centers around analysis of large volumes of data with the aim characteristics and that may be run in parallel. An App may to find unknown patterns. Notebook environments, such as be defined by wrapping an existing function or the execution Jupyter, provide an ideal interface in which researchers can of an external command-line application using Bash scripting discover and explore large data volumes using a variety of with the @App decorator. Listing 1 shows examples of these analytics approaches. Such methods are used in a wide range two types of Parsl Apps. of studies from computing the stopping power of electrons Futures: Parsl Apps are completely asynchronous. When through materials to measuring discursive influence across an App is invoked, there is no guarantee of when the result scholarship [13]. Gateways such as Cloud Kotta [14] and will be returned. Instead of directly returning a result, Parsl HubZero [15] expose Jupyter notebook interfaces for inter- returns an AppFuture: a construct that includes the real result active computing. as well as the status and exceptions for that asynchronous Exploding data acquisition rates from scientific instruments, function invocation. Parsl also supplies methods to examine such as light sources, microscopes, and telescopes, neces- the future construct, including checking status, blocking on sitate rapid analysis to avoid data loss and enable online completion, and retrieving results. Parsl leverages Python’s experiment steering. Real-time (or online) computing, such concurrent.futures module for this purpose. as that conducted at the Advanced Photon Source, allows for Parsl also introduces a model for managing the asyn- data streamed from beamline computers to be processed in chronous output files generated by an App invocation as real-time on a large cluster, with the aim to make real-time DataFutures. DataFutures extend the AppFuture model by decisions during experiments [16]. providing support for a range of operations related to files. III. PARSL MODEL A. Execution The Parsl architecture is shown in Fig. 2. Parsl scripts are When instantiating the DFK, developers specify the spe- decomposed into a simple dependency graph by the DataFlow cific execution providers and executors that will be used for Kernel (DFK). The DFK manages execution of individual executing the parallel components of the script. Execution Parsl Apps on a variety of sites. Unlike parallel scripting providers are simple abstractions over computational resources languages like Swift, in which every variable and piece of code and executors provide an abstraction layer for executing tasks. is asynchronous, Parsl relies on users to annotate functions that Parsl’s execution interface is called libsubmit [17]—a sim- will be run asynchronously based on data dependencies. The ple Python library that provides a common interface to execu- DFK provides a lightweight data management layer in which tion resources. Libsubmit’s interface defines operations such as Python objects and files are staged to an execution site via a submission, status, and job management. It currently supports dedicated communication channel or Globus. a variety of providers including Amazon Web Services, Mi- Dataflow Kernel: The DFK provides a single lightweight crosoft Azure, and Jetstream clouds as well as Cobalt, Slurm, abstraction on top of different execution resources. This ab- Torque, GridEngine, and HTCondor Local Resource Managers straction is at the heart of Parsl’s ability to transparently (LRM). New execution providers can be easily added by support different execution fabrics. implementing libsubmit’s execution provider interface. 2 10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018 Fig. 2: Parsl architecture. The DataFlow Kernel maps scripts to Executors that support diverse computational platforms. Depending on the the selected execution provider, there are MPI applications that span nodes. It requires specific MPI a number of ways to submit workload to that resource. For launchers supported by the target system such as aprun, srun, example, for local execution, threads can be used, while for a mpirun, and mpiexec. cluster, pilot jobs or specialized launchers can be used. Parsl supports these different methods via its executor interface. C. Parallelism and elasticity Parsl currently supports three executors: Rather than precompile a static representation of the entire • ThreadPoolExecutor for multi-thread execution on local workflow, Parsl implements a dynamic dependency graph resources. in which the graph is constructed as tasks are enqueued. • IPyParallelExecutor for both local and remote execution As the Parsl script executes the workflow, new tasks are using a pilot job model. The IPythonParallel controller added to a queue for execution, tasks are then executed is deployed locally and IPythonParallel engines are de- asynchronously when their dependencies are met. Parsl uses ployed on execution nodes. IPythonParallel then manages the selected executor(s) to manage task execution on the the execution of tasks on connected engines. execution provider(s). • Swift/TurbineExecutor for extreme-scale execution us- As Parsl manages a dynamic dependency graph it does ing the Swift/T (Turbine) [18] model to enable distributed not know the full “width” of a particular workflow a priori. task execution across an MPI environment. This executor Further, as a workflow executes, the needs of the tasks may is typically used on supercomputers. change as too might the capacity available on execution It is important to note that Parsl scripts are not tied to a providers. Thus, Parsl must elastically scale the resources it is specific executor or execution provider. Furthermore, a single using. To do so, it includes an extensible flow control system Parsl script may leverage multiple executors and execution to monitor outstanding tasks and available compute capacity. providers concurrently—a model we refer to as multi site. This monitor, which can be extended or implemented by users, This allows Parsl developers to mix and match resources and determines when to trigger scaling (in or out) events. execution models to meet their needs. For example, enabling Parsl provides a simple user-managed model for control- a computational simulation to run on specialized HPC nodes, ling elasticity. It allows users to prescribe the minimum and simple data manipulation tasks to be executed locally using maximum number of blocks to be used on a given execution threads, and visualizations to be rendered on GPU nodes. provider and a parameter (p) to control the level of parallelism. Where parallelism is expressed as the ratio of TaskBlocks to B. Uniform execution model active tasks. Each TaskBlock is capable of executing a single Providing a uniform representation of heterogeneous re- task at any given time. Therefore, a parallelism value of 1 sources is one of the most difficult challenges for parallel represents aggressive scaling in which as many resources as execution. Parsl provides an abstraction based on resource possible will be used; parallelism close to 0 represents the units called blocks. A block is a single unit of resources opposite situation in which few resources (i.e., 1 TaskBlock) that is obtained from an execution provider. Within a block will be used. are a number of nodes. Parsl can then create TaskBlocks within and across (e.g., for MPI jobs) nodes. A TaskBlock D. Data management is a virtual suballocation in which individual tasks can be Parsl is designed to enable implementation of dataflow launched. Figure 3 shows three different block configurations. patterns in which data passed between Apps manages the The first configuration represents the most simple model in flow of execution. Dataflow programming models are popular which a block is comprised of a single node with a single as they can cleanly express, via implicit parallelism, the TaskBlock. The second configuration, with several TaskBlocks concurrency needed by many applications in a simple and in a single node, is well suited for executing many, single intuitive way. threaded applications on a multicore node. The final configu- Parsl aims to abstract not only parallel execution but also ex- ration shows a block comprised of several nodes and offering ecution location, which in turn requires data location abstrac- several TaskBlocks. This configuration is generally used by tion. For Python Apps, Parsl uses a direct channel between 3 10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018 (a) A block comprised of a node with one (b) A block comprised of a node with (c) A block comprised of four nodes with TaskBlock. several TaskBlocks. two TaskBlocks. Fig. 3: Parsl Block model showing several common block configurations. the script and executors using Python object serialization. For checkpoint files when the DataFlow Kernel is initialized and files, Parsl implements a simple abstraction that can be used written out to checkpoint files when explicitly requested. to reference data irrespective of its location. At present this model is limited to local and Globus [19] accessible files. IV. C ASE S TUDIES The Parsl file abstraction is used to pass location- We present three workflows implemented using Parsl to independent references between Apps. It requires that the illustrate how it can satisfy the needs of different application developer initially define a file’s location (e.g., /local/path/file domains. While these workflows have not yet been imple- or globus://endpoint/file). The file may then be passed to each mented in science gateways they represent use cases that App and, when executed, Parsl will translate the location would benefit from gateway models. to a locally accessible file path. In the case of Globus, an SwiftSeq [21] is a bioinformatics workflow that supports explicit staging model is supported in which the developer aligning and genotyping gene panels, exomes, and whole must select the execution site to which the file should be genomes. The Parsl-based workflow is comprised of approxi- transferred. Parsl uses the Globus SDK and its native App mately 10 applications that communicate by writing and read- authentication model [20] to authenticate with the Globus ing files. While applications must often execute in sequence, service and securely move data between endpoints. there are also opportunities for parallelism. First, the workflow is often executed on many samples, each of which can be E. Caching analyzed in parallel; second, the large genetic sequences can When developing a workflow, developers often execute the be divided up and analyzed in parallel; and finally, some of same workflow with incremental changes over and over, this the applications themselves can also be executed in parallel. scenario is especially prevalent in interactive computing work- SwiftSeq benefits not only from Parsl’s ability to specify such flows. Often large fragments of the workflow have not been parallelism, but also from its ability to express a complex changed yet are computed again, wasting valuable developer workflow, manage the flow of data between Apps, recover time and computation resources. Caching of Apps (often called from errors, and execute on many computational resources. memoization) solves this problem by saving results from Apps Parsl has been used in computational chemistry to de- that have completed so that they can be re-used. Parsl’s velop molecular dynamics workflows. In one example, PACK- caching model stores App results in an index alongside the MOL [22] is used to assemble initial starting configurations App function, input parameters, and hash of the function body. of ionic liquid molecules with a protein (e.g., Trp-cage), If caching is enabled, by an annotation on the App function before a GPU-accelerated version of Amber [23] is used or globally at the workflow level, the cache is interrogated to energy minimize, heat, equilibrate, and run production before each App executes. Caching is supported for Python molecular dynamics simulations. The workflow relies on three and Bash Apps. Users must explicitly enable caching to avoid separate applications that are executed iteratively to perform issues with non-deterministic applications. different functions. PACKMOL is used to generate the system configuration, AmberTools are used to create input coordinate F. Checkpointing and parameter files for simulations, and Amber is used to run Large scale workflows are prone to errors due to node various simulations. Parsl allows a wide range of different failures, application or environment errors, and myriad other system configurations to be considered in parallel, and it also issues. Parsl provides fault tolerance via an incremental check- allows simple error handling logic to be expressed. pointing model, where each checkpoint call saves all results In materials science, researchers have used Parsl to predict that have been updated since the last checkpoint was created. the electronic stopping power of materials. Stopping power When loading checkpoints, if entries with results from multiple is the predominant energy-loss mechanism for charged par- functions (with identical hashes) are encountered, only the last ticles and is important for applications related to radiation entry read will be considered. Checkpoints are loaded from protection. Historically, the stopping power for a material is 4 10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018 computed using analytical models such as the Lindhard model execution of external applications. Further, Luigi offers a or using Time-Dependent Density Functional Theory (TD- execution model that deploys workers on a single cluster; it is DFT). However, these methods are not suitable in all cases not designed to support multiple sites, provide elastic resource and are computationally expensive. The Parsl-based workflow management, or handle wide area data staging. uses TD-DFT calculations of a proton passing through a ma- FireWorks is a Python-based workflow engine designed terial [24], transforms that data to a representation compatible for executing high-throughput workflows on supercomputers. with machine learning, and then executes a number of machine Workflows are described in Python, JSON, or YAML and learning algorithms to learn a predictive model. It finally as a collection of tasks which are connected together into a applies these models from various directions to calculate a “FireWork” for execution. The centralized server manages the three dimensional model of stopping power for a material. workflow, using a MongoDB database to provide persistence Parsl was used as it was able to trivially parallelize the existing and to support reliable execution on distributed resources. Fire- Python codebase, support the composition of a sophisticated Workers are deployed on compute resources to execute tasks, machine learning pipeline in a Jupyter notebook, and facilitate they connect to the centralized server to request tasks, execute scalable execution of the pipeline from within the notebook them, and return results. Unlike Parsl, FireWorks focuses on on large-scale computing resources at the Argonne Leadership the reliable execution of long running jobs and therefore may Computing Facility. not be suitable for short running jobs or applications that demand a high submission rate. V. R ELATED W ORK Many workflow systems have been developed to facili- VI. S UMMARY tate the expression and execution of arbitrary, data-oriented Parsl provides an easy-to-use model that can be easily workflows, for example, the Swift parallel scripting language. integrated in science gateways to support the management Other systems include Pegasus [25] and Galaxy [12]. A and execution of workflows composed of Python functions weakness of these systems, however, is the need to develop and external applications. Science gateways benefit from the a workflow representation in a separate representation (e.g., extensibility, scalability, and robustness of the Parsl model to a graph) Parsl provides similar capabilities, directly in a manage execution of potentially complex workflows on arbi- programming language that is broadly adopted by scientific trary computational resources. Parsl is specifically designed to users and increasingly science gateways. address new workflow modalities, such as interactive comput- There are a number of Python-based workflow tools that ing in Jupyter notebooks, and provides a seamless and trans- better match common research environments, for example, parent way to scale these analyses from within the notebook. Dask [26], Apache Airflow [27], Luigi [28], and Fire- Parsl abstracts the complexity of interacting with different Works [29]. resource fabrics and execution models. It instead supports the Dask is a parallel computing library designed for parallel development of resource-independent Python scripts. It also analytics. It allows users to trivially migrate their single-node includes a number of advanced capabilities such as automated analyses to a parallel execution environment. Unlike Parsl, elasticity, support for multi-site execution, fault tolerance, and Dask scripts use Dask-specific functions in place of common automated direct and wide area data management. libraries and programming constructs, for example using the ACKNOWLEDGMENT Dask DataFrame in place of the Pandas DataFrame. Like Parsl, Dask decomposes a script into a dependent task graph that con- This work was supported in part by NSF award ACI- trols the execution of code blocks. Parsl focuses on a broader 1550588 and DOE contract DE-AC02-06CH11357. problem, including the ability to execute arbitrary applications R EFERENCES on heterogeneous computing resources and providing support for managing data dependencies between these executions. [1] A. W. Toga, I. Foster, C. Kesselman, R. Madduri, K. Chard, E. W. Deutsch, N. D. Price, G. Glusman, B. D. Heavner, I. D. Dinov, J. Ames, Apache Airflow is a workflow engine written in Python. J. Van Horn, R. Kramer, and L. Hood, “Big biomedical data as the Developers can express directed acyclic graphs of independent key resource for discovery science,” Journal of the American Medical tasks. The Airflow scheduler is then responsible for executing Informatics Association, vol. 22, no. 6, pp. 1126–1131, 2015. [2] N. P. Tatonetti, P. P. Ye, R. Daneshjou, and R. B. Altman, “Data- the tasks on distributed workers according to their dependen- driven prediction of drug effects and interactions,” Science Translational cies. Unlike Parsl’s implicit workflow model, Airflow relies Medicine, vol. 4, no. 125, pp. 125ra31–125ra31, 2012. on users expressing their workflows as explicit tasks and with [3] L. Ward and C. Wolverton, “Atomistic calculations and materials infor- matics: A review,” Current Opinion in Solid State and Materials Science, explicit relationships between those tasks. Thus, the job of vol. 21, no. 3, pp. 167 – 176, 2017. the user is to essentially describe a task dependency graph in [4] N. Wilkins-Diehr, “Special issue: science gateways - common com- Python. munity interfaces to grid resources,” Concurrency and Computation: Practice and Experience, vol. 19, no. 6, pp. 743–749, 2007. Luigi scripts are created by writing Python classes that [5] S. Marru, L. Gunathilake, C. Herath, P. Tangchaisin, M. Pierce, extend the Luigi task model: developers implement functions C. Mattmann, R. Singh, T. Gunarathne, E. Chinthaka, R. Gardler, that manage input and output data, the code that will be run, A. Slominski, A. Douma, S. Perera, and S. Weerawarana, “Apache Airavata: A framework for distributed applications and computational as well as the explicit dependencies on other tasks. Unlike workflows,” in Proceedings of the 2011 ACM Workshop on Gateway Parsl, Luigi focuses on Python tasks rather than orchestrating Computing Environments, 2011, pp. 21–28. 5 10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018 [6] P. Kacsuk, Z. Farkas, M. Kozlovszky, G. Hermann, A. Balasko, [17] “Libsubmit,” https://github.com/Parsl/libsubmit. K. Karoczkai, and I. Marton, “WS-PGRADE/gUSE generic DCI gate- [18] T. G. Armstrong, J. M. Wozniak, M. Wilde, and I. T. Foster, “Compiler way framework for a large variety of user communities,” Journal of Grid techniques for massively scalable implicit task parallelism,” in Proceed- Computing, vol. 10, no. 4, pp. 601–630, Dec 2012. ings of the International Conference for High Performance Computing, [7] T. Glatard, M. Étienne Rousseau, S. Camarasu-Pop, R. Adalat, N. Beck, Networking, Storage and Analysis, ser. SC ’14, 2014, pp. 299–310. S. Das, R. F. da Silva, N. Khalili-Mahani, V. Korkhov, P.-O. Quirion, [19] K. Chard, S. Tuecke, and I. Foster, “Efficient and secure transfer, P. Rioux, S. D. Olabarriaga, P. Bellec, and A. C. Evans, “Software synchronization, and sharing of big data,” IEEE Cloud Computing, architectures to integrate workflow engines in science gateways,” Future vol. 1, no. 3, pp. 46–55, Sept 2014. Generation Computer Systems, vol. 75, pp. 239 – 255, 2017. [20] S. Tuecke, R. Ananthakrishnan, K. Chard, M. Lidman, B. McCollam, [8] M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and S. Rosen, and I. Foster, “Globus Auth: A research identity and access I. Foster, “Swift: A language for distributed parallel scripting,” Parallel management platform,” in 12th IEEE International Conference on e- Computing, vol. 37, no. 9, pp. 633–652, Sep. 2011. Science (e-Science), Oct 2016, pp. 203–212. [9] K. Chard, S. Tuecke, and I. Foster, “Efficient and secure transfer, [21] J. Pitt, “Swiftseq,” http://www.igsb.org/software/swiftseq. synchronization, and sharing of big data,” IEEE Cloud Computing, [22] L. Martı́nez, R. Andrade, E. G. Birgin, and J. M. Martı́nez, “PACKMOL: vol. 1, no. 3, pp. 46–55, Sept 2014. A package for building initial configurations for molecular dynamics [10] A. N. Adhikari, J. Peng, M. Wilde, J. Xu, K. F. Freed, and T. R. Sosnick, simulations,” J. Comp. Chemistry, vol. 30, no. 13, pp. 2157–2164, 2009. “Modeling large regions in proteins: Applications to loops, termini, and [23] D. A. Case, T. E. Cheatham, T. Darden, H. Gohlke, R. Luo, K. M. folding,” Protein Science, vol. 21, no. 1, pp. 107–121, 2012. Merz et al., “The Amber biomolecular simulation programs,” J. Comp. [11] J. Krüger, R. Grunzke, S. Gesing, S. Breuers, A. Brinkmann, L. de la Chemistry, vol. 26, no. 16, pp. 1668–1688, 2005. Garza, O. Kohlbacher, M. Kruse, W. E. Nagel, L. Packschies, R. Müller- [24] A. Schleife, Y. Kanai, and A. A. Correa, “Accurate atomistic first- Pfefferkorn, P. Schäfer, C. Schärfe, T. Steinke, T. Schlemmer, K. D. principles calculations of electronic stopping,” Phys. Rev. B, vol. 91, Warzecha, A. Zink, and S. Herres-Pawlis, “The MoSGrid science p. 014306, Jan 2015. gateway – a complete solution for molecular simulations,” Journal of [25] E. Deelman, G. Singh, M.-H. Su, Y. Blythe, James Gil et al., “Pegasus: Chemical Theory and Computation, vol. 10, no. 6, pp. 2232–2245, 2014. A framework for mapping complex scientific workflows onto distributed [12] E. Afgan, D. Baker, M. van den Beek et al., “The Galaxy platform systems,” Scientific Programming, vol. 13, no. 3, pp. 219–237, 2005. for accessible, reproducible and collaborative biomedical analyses: 2016 [26] M. Rocklin, “Dask: Parallel computation with blocked algorithms and update,” Nucleic Acids Res., vol. 44, no. W1, p. W3, 2016. task scheduling,” in Proc. 14th Python in Sci. Conf., 2015, pp. 130–136. [13] A. Gerow, Y. Hu, J. Boyd-Graber, D. M. Blei, and J. A. Evans, [27] Apache Airflow Project, “Apache Airflow,” “Measuring discursive influence across scholarship,” Proceedings of the https://airflow.incubator.apache.org/. National Academy of Sciences, 2018. [28] Spotify, “Luigi,” https://github.com/spotify/luigi. [14] Y. N. Babuji, K. Chard, and E. Duede, “Enabling interactive analytics [29] A. Jain, S. P. Ong, W. Chen, B. Medasani, X. Qu, M. Kocher, of secure data using cloud kotta,” in 8th Workshop on Scientific Cloud M. Brafman, G. Petretto, G. Rignanese, G. Hautier, D. Gunter, and Computing, ser. ScienceCloud ’17, 2017, pp. 9–15. K. A. Persson, “Fireworks: a dynamic workflow system designed for [15] M. McLennan and R. Kennell, “HUBzero: A platform for dissemination high-throughput applications,” Concurrency and Computation: Practice and collaboration in computational science and engineering,” IEEE Des. and Experience, vol. 27, no. 17, pp. 5037–5059, 2015. Test, vol. 12, no. 2, pp. 48–53, Mar. 2010. [16] T. Bicer, D. Gursoy, R. Kettimuthu, I. T. Foster, B. Ren, V. D. Andrede, and F. D. Carlo, “Real-time data analysis and autonomous steering of synchrotron light source experiments,” in 13th IEEE International Conference on e-Science (e-Science), Oct 2017, pp. 59–68. 6