=Paper= {{Paper |id=Vol-2357/paper11 |storemode=property |title=Parsl: Scalable Parallel Scripting in Python |pdfUrl=https://ceur-ws.org/Vol-2357/paper11.pdf |volume=Vol-2357 |authors=Yadu Babuji,Kyle Chard,Ian Foster,Daniel S. Katz,Mike Wilde,Anna Woodard,Justin Wozniak |dblpUrl=https://dblp.org/rec/conf/iwsg/BabujiCFKWWW18 }} ==Parsl: Scalable Parallel Scripting in Python== https://ceur-ws.org/Vol-2357/paper11.pdf
                       10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018



            Parsl: Scalable Parallel Scripting in Python
Yadu Babuji∗ , Kyle Chard∗ , Ian Foster∗ , Daniel S. Katz§ , Michael Wilde∗ , Anna Woodard∗ , and Justin Wozniak∗
              ∗ Computation Institute, University of Chicago & Argonne National Laboratory, Chicago, IL, USA
       § National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL, USA


   Abstract—Computational and data-driven research practices          selected Python functions and external applications (called
have significantly changed over the past decade to encompass          Apps) to be connected by shared input/output data objects
new analysis models such as interactive and online computing.         into flexible parallel workflows. Parsl abstracts the specific
Science gateways are simultaneously evolving to support this
transforming landscape with the aim to enable transparent,            execution environment, allowing the same script to be exe-
scalable execution of a variety of analyses. Science gateways often   cuted on arbitrary multicore processors, clusters, clouds, and
rely on workflow management systems to represent and execute          supercomputers.
analyses efficiently and reliably. However, integrating workflow         When a Parsl script is executed, the Parsl library causes
systems in science gateways can be challenging, especially as         annotated functions (Apps) to be intercepted by the Parsl
analyses become more interactive and dynamic, requiring so-
phisticated orchestration and management of applications and          execution fabric, which captures and serializes their param-
data, and customization for specific execution environments. Parsl    eters, analyzes their dependencies, and runs them on selected
(Parallel Scripting Library), a Python library for programming        resources, referred to as sites. The execution fabric brings
and executing data-oriented workflows in parallel, addresses          dependency awareness to Apps by introducing data futures as
these problems. Developers simply annotate a Python script with       the inputs and outputs of Apps. Apps that use a data future as
Parsl directives wrapping either Python functions or calls to
external applications. Parsl manages the execution of the script on   an input can be enqueued but will be blocked until that data
clusters, clouds, grids, and other resources; orchestrates required   future has been written. This feature allows Apps to execute
data movement; and manages the execution of Python functions          in parallel whenever they do not share dependencies or their
and external applications in parallel. The Parsl library can be       data dependencies have been resolved.
easily integrated into Python-based gateways, allowing for simple        Fig. 1 depicts how Parsl interacts with its environment,
management and scaling of workflows.
   Parsl, Parallel scripting, Python, Scientific Workflows—           including code, data, and resources. Parsl provides several
                                                                      advantages to science gateways: it allows a single script to
                                                                      be executed on any computing infrastructures from clouds to
                       I. I NTRODUCTION
                                                                      supercomputers; it provides fault tolerance, automated elastic-
   Data-driven research methodologies have had a disruptive           ity, and support for various execution models; it handles data
impact on science, enabling new types of exploration and              management by staging local data through its secure message
facilitating new discoveries [1], [2], [3]. Underlying these          queue and by managing wide area transfers with Globus [9];
methodologies are new tools and technologies such as Jupyter          and it can be trivially integrated via its Python interface.
notebooks for interactive analysis, scripting languages for              In this paper we describe Parsl, highlighting how it allows
flexible exploration, and a suite of libraries like Pandas and        standard Python scripts and science gateways to be augmented
scikit-learn that facilitate cutting-edge analyses.                   to execute complex workflows and facilitate parallel execution.
   Science gateways [4] have long supported the varied needs          We describe Parsl’s unique capabilities and present several
of users, providing intuitive interfaces for end users to ac-         example workflows that are common in science gateways from
cess both data and computing capabilities. Science gate-              computational chemistry, materials science, and biology, to
way frameworks, such as Apache Airavata [5] and WS-                   highlight the power of the approach.
PGRADE/gUSE [6], often rely on workflow frameworks to
represent and execute analyses that benefit from extensibility,                         II. W ORKFLOW MODELS
scalability, and robustness [7]. However, there are two signif-          Parsl is designed to support not only traditional many-
icant challenges associated with current approaches: 1) many          task workflow models but also new analysis models that are
workflow engines are focused on many task applications rather         and will be increasingly supported by science gateways (e.g.,
than interactive, online, or machine learning analyses; and 2)        online and interactive computing). We briefly describe three
workflow engines are not easily integrated into external ser-         such workflow models that can be supported by Parsl.
vices (e.g., gateways) due to issues such as language mismatch           Workflows have long been applied to a range of many-
and the need for intermediate workflow representations.               task applications, for example protein-ligand docking for drug
   Here we present Parsl, a Python parallel scripting library         screening [10]. Here, workflows are used to orchestrate a series
that supports the development and execution of asynchronous           of external applications to be applied to a large set of input
and implicitly parallel data-oriented workflows. Building on          data. For example, in drug screening, dozens of proteins are
the model used by the Swift workflow language [8], Parsl              evaluated against hundreds of thousands of drug candidates to
brings parallel workflow capabilities to scripts, applications,       identify the location and orientation of a ligand that binds to
and gateways implemented in Python. Parsl scripts allow               a protein receptor. The top candidates are then processed with
                      10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


                                                                        @python_app
                                                                        def hello():
                                                                            return 'Hello World!'

                                                                        @bash_app
                                                                        def hello(inputs=[], outputs=[],
                                                                                stdout=None, stderr=None):
                                                                            return 'echo "Hello World"'
                                                                                   Listing 1: Two examples of Parsl Apps.


                                                                           Parsl launches asynchronous Apps and passes futures to
                                                                        other Apps in lieu of computing results synchronously. The
                  Fig. 1: Parsl environment.
                                                                        DFK is responsible for managing a script’s execution, making
                                                                        ordinary functions aware of futures and ensuring the execu-
                                                                        tion of these functions are conditional on the resolution of
detailed molecular dynamics simulations to identify the most
                                                                        all dependent futures. This enables completely asynchronous
likely combinations to be used for further experimentation.
                                                                        management of all launched tasks with the data dependencies
Gateways such as MoSGrid [11] and Galaxy [12] support such
                                                                        alone determining the order of execution.
workflows.
                                                                           Apps: A Parsl script is comprised of standard Python code
   Discovery science represents a new research methodology
                                                                        plus a number of Apps—annotated units of Python code
based on explorative, interactive analysis. The general model
                                                                        or external applications that specify their input and output
centers around analysis of large volumes of data with the aim
                                                                        characteristics and that may be run in parallel. An App may
to find unknown patterns. Notebook environments, such as
                                                                        be defined by wrapping an existing function or the execution
Jupyter, provide an ideal interface in which researchers can
                                                                        of an external command-line application using Bash scripting
discover and explore large data volumes using a variety of
                                                                        with the @App decorator. Listing 1 shows examples of these
analytics approaches. Such methods are used in a wide range
                                                                        two types of Parsl Apps.
of studies from computing the stopping power of electrons
                                                                           Futures: Parsl Apps are completely asynchronous. When
through materials to measuring discursive influence across
                                                                        an App is invoked, there is no guarantee of when the result
scholarship [13]. Gateways such as Cloud Kotta [14] and
                                                                        will be returned. Instead of directly returning a result, Parsl
HubZero [15] expose Jupyter notebook interfaces for inter-
                                                                        returns an AppFuture: a construct that includes the real result
active computing.
                                                                        as well as the status and exceptions for that asynchronous
   Exploding data acquisition rates from scientific instruments,
                                                                        function invocation. Parsl also supplies methods to examine
such as light sources, microscopes, and telescopes, neces-
                                                                        the future construct, including checking status, blocking on
sitate rapid analysis to avoid data loss and enable online
                                                                        completion, and retrieving results. Parsl leverages Python’s
experiment steering. Real-time (or online) computing, such
                                                                        concurrent.futures module for this purpose.
as that conducted at the Advanced Photon Source, allows for
                                                                           Parsl also introduces a model for managing the asyn-
data streamed from beamline computers to be processed in
                                                                        chronous output files generated by an App invocation as
real-time on a large cluster, with the aim to make real-time
                                                                        DataFutures. DataFutures extend the AppFuture model by
decisions during experiments [16].
                                                                        providing support for a range of operations related to files.
                      III. PARSL MODEL                                  A. Execution
   The Parsl architecture is shown in Fig. 2. Parsl scripts are            When instantiating the DFK, developers specify the spe-
decomposed into a simple dependency graph by the DataFlow               cific execution providers and executors that will be used for
Kernel (DFK). The DFK manages execution of individual                   executing the parallel components of the script. Execution
Parsl Apps on a variety of sites. Unlike parallel scripting             providers are simple abstractions over computational resources
languages like Swift, in which every variable and piece of code         and executors provide an abstraction layer for executing tasks.
is asynchronous, Parsl relies on users to annotate functions that          Parsl’s execution interface is called libsubmit [17]—a sim-
will be run asynchronously based on data dependencies. The              ple Python library that provides a common interface to execu-
DFK provides a lightweight data management layer in which               tion resources. Libsubmit’s interface defines operations such as
Python objects and files are staged to an execution site via a          submission, status, and job management. It currently supports
dedicated communication channel or Globus.                              a variety of providers including Amazon Web Services, Mi-
   Dataflow Kernel: The DFK provides a single lightweight               crosoft Azure, and Jetstream clouds as well as Cobalt, Slurm,
abstraction on top of different execution resources. This ab-           Torque, GridEngine, and HTCondor Local Resource Managers
straction is at the heart of Parsl’s ability to transparently           (LRM). New execution providers can be easily added by
support different execution fabrics.                                    implementing libsubmit’s execution provider interface.



                                                                    2
                      10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018




  Fig. 2: Parsl architecture. The DataFlow Kernel maps scripts to Executors that support diverse computational platforms.

   Depending on the the selected execution provider, there are         MPI applications that span nodes. It requires specific MPI
a number of ways to submit workload to that resource. For              launchers supported by the target system such as aprun, srun,
example, for local execution, threads can be used, while for a         mpirun, and mpiexec.
cluster, pilot jobs or specialized launchers can be used. Parsl
supports these different methods via its executor interface.           C. Parallelism and elasticity
Parsl currently supports three executors:                                 Rather than precompile a static representation of the entire
   • ThreadPoolExecutor for multi-thread execution on local            workflow, Parsl implements a dynamic dependency graph
      resources.                                                       in which the graph is constructed as tasks are enqueued.
   • IPyParallelExecutor for both local and remote execution           As the Parsl script executes the workflow, new tasks are
      using a pilot job model. The IPythonParallel controller          added to a queue for execution, tasks are then executed
      is deployed locally and IPythonParallel engines are de-          asynchronously when their dependencies are met. Parsl uses
      ployed on execution nodes. IPythonParallel then manages          the selected executor(s) to manage task execution on the
      the execution of tasks on connected engines.                     execution provider(s).
   • Swift/TurbineExecutor for extreme-scale execution us-                As Parsl manages a dynamic dependency graph it does
      ing the Swift/T (Turbine) [18] model to enable distributed       not know the full “width” of a particular workflow a priori.
      task execution across an MPI environment. This executor          Further, as a workflow executes, the needs of the tasks may
      is typically used on supercomputers.                             change as too might the capacity available on execution
   It is important to note that Parsl scripts are not tied to a        providers. Thus, Parsl must elastically scale the resources it is
specific executor or execution provider. Furthermore, a single         using. To do so, it includes an extensible flow control system
Parsl script may leverage multiple executors and execution             to monitor outstanding tasks and available compute capacity.
providers concurrently—a model we refer to as multi site.              This monitor, which can be extended or implemented by users,
This allows Parsl developers to mix and match resources and            determines when to trigger scaling (in or out) events.
execution models to meet their needs. For example, enabling               Parsl provides a simple user-managed model for control-
a computational simulation to run on specialized HPC nodes,            ling elasticity. It allows users to prescribe the minimum and
simple data manipulation tasks to be executed locally using            maximum number of blocks to be used on a given execution
threads, and visualizations to be rendered on GPU nodes.               provider and a parameter (p) to control the level of parallelism.
                                                                       Where parallelism is expressed as the ratio of TaskBlocks to
B. Uniform execution model                                             active tasks. Each TaskBlock is capable of executing a single
   Providing a uniform representation of heterogeneous re-             task at any given time. Therefore, a parallelism value of 1
sources is one of the most difficult challenges for parallel           represents aggressive scaling in which as many resources as
execution. Parsl provides an abstraction based on resource             possible will be used; parallelism close to 0 represents the
units called blocks. A block is a single unit of resources             opposite situation in which few resources (i.e., 1 TaskBlock)
that is obtained from an execution provider. Within a block            will be used.
are a number of nodes. Parsl can then create TaskBlocks
within and across (e.g., for MPI jobs) nodes. A TaskBlock              D. Data management
is a virtual suballocation in which individual tasks can be               Parsl is designed to enable implementation of dataflow
launched. Figure 3 shows three different block configurations.         patterns in which data passed between Apps manages the
The first configuration represents the most simple model in            flow of execution. Dataflow programming models are popular
which a block is comprised of a single node with a single              as they can cleanly express, via implicit parallelism, the
TaskBlock. The second configuration, with several TaskBlocks           concurrency needed by many applications in a simple and
in a single node, is well suited for executing many, single            intuitive way.
threaded applications on a multicore node. The final configu-             Parsl aims to abstract not only parallel execution but also ex-
ration shows a block comprised of several nodes and offering           ecution location, which in turn requires data location abstrac-
several TaskBlocks. This configuration is generally used by            tion. For Python Apps, Parsl uses a direct channel between



                                                                   3
                       10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018




(a) A block comprised of a node with one        (b) A block comprised of a node with                (c) A block comprised of four nodes with
TaskBlock.                                      several TaskBlocks.                                 two TaskBlocks.
                            Fig. 3: Parsl Block model showing several common block configurations.


the script and executors using Python object serialization. For            checkpoint files when the DataFlow Kernel is initialized and
files, Parsl implements a simple abstraction that can be used              written out to checkpoint files when explicitly requested.
to reference data irrespective of its location. At present this
model is limited to local and Globus [19] accessible files.                                     IV. C ASE S TUDIES
   The Parsl file abstraction is used to pass location-                       We present three workflows implemented using Parsl to
independent references between Apps. It requires that the                  illustrate how it can satisfy the needs of different application
developer initially define a file’s location (e.g., /local/path/file       domains. While these workflows have not yet been imple-
or globus://endpoint/file). The file may then be passed to each            mented in science gateways they represent use cases that
App and, when executed, Parsl will translate the location                  would benefit from gateway models.
to a locally accessible file path. In the case of Globus, an                  SwiftSeq [21] is a bioinformatics workflow that supports
explicit staging model is supported in which the developer                 aligning and genotyping gene panels, exomes, and whole
must select the execution site to which the file should be                 genomes. The Parsl-based workflow is comprised of approxi-
transferred. Parsl uses the Globus SDK and its native App                  mately 10 applications that communicate by writing and read-
authentication model [20] to authenticate with the Globus                  ing files. While applications must often execute in sequence,
service and securely move data between endpoints.                          there are also opportunities for parallelism. First, the workflow
                                                                           is often executed on many samples, each of which can be
E. Caching                                                                 analyzed in parallel; second, the large genetic sequences can
   When developing a workflow, developers often execute the                be divided up and analyzed in parallel; and finally, some of
same workflow with incremental changes over and over, this                 the applications themselves can also be executed in parallel.
scenario is especially prevalent in interactive computing work-            SwiftSeq benefits not only from Parsl’s ability to specify such
flows. Often large fragments of the workflow have not been                 parallelism, but also from its ability to express a complex
changed yet are computed again, wasting valuable developer                 workflow, manage the flow of data between Apps, recover
time and computation resources. Caching of Apps (often called              from errors, and execute on many computational resources.
memoization) solves this problem by saving results from Apps                  Parsl has been used in computational chemistry to de-
that have completed so that they can be re-used. Parsl’s                   velop molecular dynamics workflows. In one example, PACK-
caching model stores App results in an index alongside the                 MOL [22] is used to assemble initial starting configurations
App function, input parameters, and hash of the function body.             of ionic liquid molecules with a protein (e.g., Trp-cage),
If caching is enabled, by an annotation on the App function                before a GPU-accelerated version of Amber [23] is used
or globally at the workflow level, the cache is interrogated               to energy minimize, heat, equilibrate, and run production
before each App executes. Caching is supported for Python                  molecular dynamics simulations. The workflow relies on three
and Bash Apps. Users must explicitly enable caching to avoid               separate applications that are executed iteratively to perform
issues with non-deterministic applications.                                different functions. PACKMOL is used to generate the system
                                                                           configuration, AmberTools are used to create input coordinate
F. Checkpointing                                                           and parameter files for simulations, and Amber is used to run
   Large scale workflows are prone to errors due to node                   various simulations. Parsl allows a wide range of different
failures, application or environment errors, and myriad other              system configurations to be considered in parallel, and it also
issues. Parsl provides fault tolerance via an incremental check-           allows simple error handling logic to be expressed.
pointing model, where each checkpoint call saves all results                  In materials science, researchers have used Parsl to predict
that have been updated since the last checkpoint was created.              the electronic stopping power of materials. Stopping power
When loading checkpoints, if entries with results from multiple            is the predominant energy-loss mechanism for charged par-
functions (with identical hashes) are encountered, only the last           ticles and is important for applications related to radiation
entry read will be considered. Checkpoints are loaded from                 protection. Historically, the stopping power for a material is



                                                                       4
                       10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


computed using analytical models such as the Lindhard model               execution of external applications. Further, Luigi offers a
or using Time-Dependent Density Functional Theory (TD-                    execution model that deploys workers on a single cluster; it is
DFT). However, these methods are not suitable in all cases                not designed to support multiple sites, provide elastic resource
and are computationally expensive. The Parsl-based workflow               management, or handle wide area data staging.
uses TD-DFT calculations of a proton passing through a ma-                   FireWorks is a Python-based workflow engine designed
terial [24], transforms that data to a representation compatible          for executing high-throughput workflows on supercomputers.
with machine learning, and then executes a number of machine              Workflows are described in Python, JSON, or YAML and
learning algorithms to learn a predictive model. It finally               as a collection of tasks which are connected together into a
applies these models from various directions to calculate a               “FireWork” for execution. The centralized server manages the
three dimensional model of stopping power for a material.                 workflow, using a MongoDB database to provide persistence
Parsl was used as it was able to trivially parallelize the existing       and to support reliable execution on distributed resources. Fire-
Python codebase, support the composition of a sophisticated               Workers are deployed on compute resources to execute tasks,
machine learning pipeline in a Jupyter notebook, and facilitate           they connect to the centralized server to request tasks, execute
scalable execution of the pipeline from within the notebook               them, and return results. Unlike Parsl, FireWorks focuses on
on large-scale computing resources at the Argonne Leadership              the reliable execution of long running jobs and therefore may
Computing Facility.                                                       not be suitable for short running jobs or applications that
                                                                          demand a high submission rate.
                      V. R ELATED W ORK
   Many workflow systems have been developed to facili-                                                VI. S UMMARY
tate the expression and execution of arbitrary, data-oriented                Parsl provides an easy-to-use model that can be easily
workflows, for example, the Swift parallel scripting language.            integrated in science gateways to support the management
Other systems include Pegasus [25] and Galaxy [12]. A                     and execution of workflows composed of Python functions
weakness of these systems, however, is the need to develop                and external applications. Science gateways benefit from the
a workflow representation in a separate representation (e.g.,             extensibility, scalability, and robustness of the Parsl model to
a graph) Parsl provides similar capabilities, directly in a               manage execution of potentially complex workflows on arbi-
programming language that is broadly adopted by scientific                trary computational resources. Parsl is specifically designed to
users and increasingly science gateways.                                  address new workflow modalities, such as interactive comput-
   There are a number of Python-based workflow tools that                 ing in Jupyter notebooks, and provides a seamless and trans-
better match common research environments, for example,                   parent way to scale these analyses from within the notebook.
Dask [26], Apache Airflow [27], Luigi [28], and Fire-                     Parsl abstracts the complexity of interacting with different
Works [29].                                                               resource fabrics and execution models. It instead supports the
   Dask is a parallel computing library designed for parallel             development of resource-independent Python scripts. It also
analytics. It allows users to trivially migrate their single-node         includes a number of advanced capabilities such as automated
analyses to a parallel execution environment. Unlike Parsl,               elasticity, support for multi-site execution, fault tolerance, and
Dask scripts use Dask-specific functions in place of common               automated direct and wide area data management.
libraries and programming constructs, for example using the
                                                                                                   ACKNOWLEDGMENT
Dask DataFrame in place of the Pandas DataFrame. Like Parsl,
Dask decomposes a script into a dependent task graph that con-              This work was supported in part by NSF award ACI-
trols the execution of code blocks. Parsl focuses on a broader            1550588 and DOE contract DE-AC02-06CH11357.
problem, including the ability to execute arbitrary applications
                                                                                                        R EFERENCES
on heterogeneous computing resources and providing support
for managing data dependencies between these executions.                   [1] A. W. Toga, I. Foster, C. Kesselman, R. Madduri, K. Chard, E. W.
                                                                               Deutsch, N. D. Price, G. Glusman, B. D. Heavner, I. D. Dinov, J. Ames,
   Apache Airflow is a workflow engine written in Python.                      J. Van Horn, R. Kramer, and L. Hood, “Big biomedical data as the
Developers can express directed acyclic graphs of independent                  key resource for discovery science,” Journal of the American Medical
tasks. The Airflow scheduler is then responsible for executing                 Informatics Association, vol. 22, no. 6, pp. 1126–1131, 2015.
                                                                           [2] N. P. Tatonetti, P. P. Ye, R. Daneshjou, and R. B. Altman, “Data-
the tasks on distributed workers according to their dependen-                  driven prediction of drug effects and interactions,” Science Translational
cies. Unlike Parsl’s implicit workflow model, Airflow relies                   Medicine, vol. 4, no. 125, pp. 125ra31–125ra31, 2012.
on users expressing their workflows as explicit tasks and with             [3] L. Ward and C. Wolverton, “Atomistic calculations and materials infor-
                                                                               matics: A review,” Current Opinion in Solid State and Materials Science,
explicit relationships between those tasks. Thus, the job of                   vol. 21, no. 3, pp. 167 – 176, 2017.
the user is to essentially describe a task dependency graph in             [4] N. Wilkins-Diehr, “Special issue: science gateways - common com-
Python.                                                                        munity interfaces to grid resources,” Concurrency and Computation:
                                                                               Practice and Experience, vol. 19, no. 6, pp. 743–749, 2007.
   Luigi scripts are created by writing Python classes that                [5] S. Marru, L. Gunathilake, C. Herath, P. Tangchaisin, M. Pierce,
extend the Luigi task model: developers implement functions                    C. Mattmann, R. Singh, T. Gunarathne, E. Chinthaka, R. Gardler,
that manage input and output data, the code that will be run,                  A. Slominski, A. Douma, S. Perera, and S. Weerawarana, “Apache
                                                                               Airavata: A framework for distributed applications and computational
as well as the explicit dependencies on other tasks. Unlike                    workflows,” in Proceedings of the 2011 ACM Workshop on Gateway
Parsl, Luigi focuses on Python tasks rather than orchestrating                 Computing Environments, 2011, pp. 21–28.




                                                                      5
                           10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


 [6] P. Kacsuk, Z. Farkas, M. Kozlovszky, G. Hermann, A. Balasko,                      [17] “Libsubmit,” https://github.com/Parsl/libsubmit.
     K. Karoczkai, and I. Marton, “WS-PGRADE/gUSE generic DCI gate-                    [18] T. G. Armstrong, J. M. Wozniak, M. Wilde, and I. T. Foster, “Compiler
     way framework for a large variety of user communities,” Journal of Grid                techniques for massively scalable implicit task parallelism,” in Proceed-
     Computing, vol. 10, no. 4, pp. 601–630, Dec 2012.                                      ings of the International Conference for High Performance Computing,
 [7] T. Glatard, M. Étienne Rousseau, S. Camarasu-Pop, R. Adalat, N. Beck,                 Networking, Storage and Analysis, ser. SC ’14, 2014, pp. 299–310.
     S. Das, R. F. da Silva, N. Khalili-Mahani, V. Korkhov, P.-O. Quirion,             [19] K. Chard, S. Tuecke, and I. Foster, “Efficient and secure transfer,
     P. Rioux, S. D. Olabarriaga, P. Bellec, and A. C. Evans, “Software                     synchronization, and sharing of big data,” IEEE Cloud Computing,
     architectures to integrate workflow engines in science gateways,” Future               vol. 1, no. 3, pp. 46–55, Sept 2014.
     Generation Computer Systems, vol. 75, pp. 239 – 255, 2017.                        [20] S. Tuecke, R. Ananthakrishnan, K. Chard, M. Lidman, B. McCollam,
 [8] M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and                      S. Rosen, and I. Foster, “Globus Auth: A research identity and access
     I. Foster, “Swift: A language for distributed parallel scripting,” Parallel            management platform,” in 12th IEEE International Conference on e-
     Computing, vol. 37, no. 9, pp. 633–652, Sep. 2011.                                     Science (e-Science), Oct 2016, pp. 203–212.
 [9] K. Chard, S. Tuecke, and I. Foster, “Efficient and secure transfer,               [21] J. Pitt, “Swiftseq,” http://www.igsb.org/software/swiftseq.
     synchronization, and sharing of big data,” IEEE Cloud Computing,                  [22] L. Martı́nez, R. Andrade, E. G. Birgin, and J. M. Martı́nez, “PACKMOL:
     vol. 1, no. 3, pp. 46–55, Sept 2014.                                                   A package for building initial configurations for molecular dynamics
[10] A. N. Adhikari, J. Peng, M. Wilde, J. Xu, K. F. Freed, and T. R. Sosnick,              simulations,” J. Comp. Chemistry, vol. 30, no. 13, pp. 2157–2164, 2009.
     “Modeling large regions in proteins: Applications to loops, termini, and          [23] D. A. Case, T. E. Cheatham, T. Darden, H. Gohlke, R. Luo, K. M.
     folding,” Protein Science, vol. 21, no. 1, pp. 107–121, 2012.                          Merz et al., “The Amber biomolecular simulation programs,” J. Comp.
[11] J. Krüger, R. Grunzke, S. Gesing, S. Breuers, A. Brinkmann, L. de la                  Chemistry, vol. 26, no. 16, pp. 1668–1688, 2005.
     Garza, O. Kohlbacher, M. Kruse, W. E. Nagel, L. Packschies, R. Müller-           [24] A. Schleife, Y. Kanai, and A. A. Correa, “Accurate atomistic first-
     Pfefferkorn, P. Schäfer, C. Schärfe, T. Steinke, T. Schlemmer, K. D.                 principles calculations of electronic stopping,” Phys. Rev. B, vol. 91,
     Warzecha, A. Zink, and S. Herres-Pawlis, “The MoSGrid science                          p. 014306, Jan 2015.
     gateway – a complete solution for molecular simulations,” Journal of              [25] E. Deelman, G. Singh, M.-H. Su, Y. Blythe, James Gil et al., “Pegasus:
     Chemical Theory and Computation, vol. 10, no. 6, pp. 2232–2245, 2014.                  A framework for mapping complex scientific workflows onto distributed
[12] E. Afgan, D. Baker, M. van den Beek et al., “The Galaxy platform                       systems,” Scientific Programming, vol. 13, no. 3, pp. 219–237, 2005.
     for accessible, reproducible and collaborative biomedical analyses: 2016          [26] M. Rocklin, “Dask: Parallel computation with blocked algorithms and
     update,” Nucleic Acids Res., vol. 44, no. W1, p. W3, 2016.                             task scheduling,” in Proc. 14th Python in Sci. Conf., 2015, pp. 130–136.
[13] A. Gerow, Y. Hu, J. Boyd-Graber, D. M. Blei, and J. A. Evans,                     [27] Apache           Airflow         Project,       “Apache          Airflow,”
     “Measuring discursive influence across scholarship,” Proceedings of the                https://airflow.incubator.apache.org/.
     National Academy of Sciences, 2018.                                               [28] Spotify, “Luigi,” https://github.com/spotify/luigi.
[14] Y. N. Babuji, K. Chard, and E. Duede, “Enabling interactive analytics             [29] A. Jain, S. P. Ong, W. Chen, B. Medasani, X. Qu, M. Kocher,
     of secure data using cloud kotta,” in 8th Workshop on Scientific Cloud                 M. Brafman, G. Petretto, G. Rignanese, G. Hautier, D. Gunter, and
     Computing, ser. ScienceCloud ’17, 2017, pp. 9–15.                                      K. A. Persson, “Fireworks: a dynamic workflow system designed for
[15] M. McLennan and R. Kennell, “HUBzero: A platform for dissemination                     high-throughput applications,” Concurrency and Computation: Practice
     and collaboration in computational science and engineering,” IEEE Des.                 and Experience, vol. 27, no. 17, pp. 5037–5059, 2015.
     Test, vol. 12, no. 2, pp. 48–53, Mar. 2010.
[16] T. Bicer, D. Gursoy, R. Kettimuthu, I. T. Foster, B. Ren, V. D. Andrede,
     and F. D. Carlo, “Real-time data analysis and autonomous steering
     of synchrotron light source experiments,” in 13th IEEE International
     Conference on e-Science (e-Science), Oct 2017, pp. 59–68.




                                                                                   6