=Paper=
{{Paper
|id=Vol-2357/paper11
|storemode=property
|title=Parsl: Scalable Parallel Scripting in Python
|pdfUrl=https://ceur-ws.org/Vol-2357/paper11.pdf
|volume=Vol-2357
|authors=Yadu Babuji,Kyle Chard,Ian Foster,Daniel S. Katz,Mike Wilde,Anna Woodard,Justin Wozniak
|dblpUrl=https://dblp.org/rec/conf/iwsg/BabujiCFKWWW18
}}
==Parsl: Scalable Parallel Scripting in Python==
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
Parsl: Scalable Parallel Scripting in Python
Yadu Babuji∗ , Kyle Chard∗ , Ian Foster∗ , Daniel S. Katz§ , Michael Wilde∗ , Anna Woodard∗ , and Justin Wozniak∗
∗ Computation Institute, University of Chicago & Argonne National Laboratory, Chicago, IL, USA
§ National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL, USA
Abstract—Computational and data-driven research practices selected Python functions and external applications (called
have significantly changed over the past decade to encompass Apps) to be connected by shared input/output data objects
new analysis models such as interactive and online computing. into flexible parallel workflows. Parsl abstracts the specific
Science gateways are simultaneously evolving to support this
transforming landscape with the aim to enable transparent, execution environment, allowing the same script to be exe-
scalable execution of a variety of analyses. Science gateways often cuted on arbitrary multicore processors, clusters, clouds, and
rely on workflow management systems to represent and execute supercomputers.
analyses efficiently and reliably. However, integrating workflow When a Parsl script is executed, the Parsl library causes
systems in science gateways can be challenging, especially as annotated functions (Apps) to be intercepted by the Parsl
analyses become more interactive and dynamic, requiring so-
phisticated orchestration and management of applications and execution fabric, which captures and serializes their param-
data, and customization for specific execution environments. Parsl eters, analyzes their dependencies, and runs them on selected
(Parallel Scripting Library), a Python library for programming resources, referred to as sites. The execution fabric brings
and executing data-oriented workflows in parallel, addresses dependency awareness to Apps by introducing data futures as
these problems. Developers simply annotate a Python script with the inputs and outputs of Apps. Apps that use a data future as
Parsl directives wrapping either Python functions or calls to
external applications. Parsl manages the execution of the script on an input can be enqueued but will be blocked until that data
clusters, clouds, grids, and other resources; orchestrates required future has been written. This feature allows Apps to execute
data movement; and manages the execution of Python functions in parallel whenever they do not share dependencies or their
and external applications in parallel. The Parsl library can be data dependencies have been resolved.
easily integrated into Python-based gateways, allowing for simple Fig. 1 depicts how Parsl interacts with its environment,
management and scaling of workflows.
Parsl, Parallel scripting, Python, Scientific Workflows— including code, data, and resources. Parsl provides several
advantages to science gateways: it allows a single script to
be executed on any computing infrastructures from clouds to
I. I NTRODUCTION
supercomputers; it provides fault tolerance, automated elastic-
Data-driven research methodologies have had a disruptive ity, and support for various execution models; it handles data
impact on science, enabling new types of exploration and management by staging local data through its secure message
facilitating new discoveries [1], [2], [3]. Underlying these queue and by managing wide area transfers with Globus [9];
methodologies are new tools and technologies such as Jupyter and it can be trivially integrated via its Python interface.
notebooks for interactive analysis, scripting languages for In this paper we describe Parsl, highlighting how it allows
flexible exploration, and a suite of libraries like Pandas and standard Python scripts and science gateways to be augmented
scikit-learn that facilitate cutting-edge analyses. to execute complex workflows and facilitate parallel execution.
Science gateways [4] have long supported the varied needs We describe Parsl’s unique capabilities and present several
of users, providing intuitive interfaces for end users to ac- example workflows that are common in science gateways from
cess both data and computing capabilities. Science gate- computational chemistry, materials science, and biology, to
way frameworks, such as Apache Airavata [5] and WS- highlight the power of the approach.
PGRADE/gUSE [6], often rely on workflow frameworks to
represent and execute analyses that benefit from extensibility, II. W ORKFLOW MODELS
scalability, and robustness [7]. However, there are two signif- Parsl is designed to support not only traditional many-
icant challenges associated with current approaches: 1) many task workflow models but also new analysis models that are
workflow engines are focused on many task applications rather and will be increasingly supported by science gateways (e.g.,
than interactive, online, or machine learning analyses; and 2) online and interactive computing). We briefly describe three
workflow engines are not easily integrated into external ser- such workflow models that can be supported by Parsl.
vices (e.g., gateways) due to issues such as language mismatch Workflows have long been applied to a range of many-
and the need for intermediate workflow representations. task applications, for example protein-ligand docking for drug
Here we present Parsl, a Python parallel scripting library screening [10]. Here, workflows are used to orchestrate a series
that supports the development and execution of asynchronous of external applications to be applied to a large set of input
and implicitly parallel data-oriented workflows. Building on data. For example, in drug screening, dozens of proteins are
the model used by the Swift workflow language [8], Parsl evaluated against hundreds of thousands of drug candidates to
brings parallel workflow capabilities to scripts, applications, identify the location and orientation of a ligand that binds to
and gateways implemented in Python. Parsl scripts allow a protein receptor. The top candidates are then processed with
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
@python_app
def hello():
return 'Hello World!'
@bash_app
def hello(inputs=[], outputs=[],
stdout=None, stderr=None):
return 'echo "Hello World"'
Listing 1: Two examples of Parsl Apps.
Parsl launches asynchronous Apps and passes futures to
other Apps in lieu of computing results synchronously. The
Fig. 1: Parsl environment.
DFK is responsible for managing a script’s execution, making
ordinary functions aware of futures and ensuring the execu-
tion of these functions are conditional on the resolution of
detailed molecular dynamics simulations to identify the most
all dependent futures. This enables completely asynchronous
likely combinations to be used for further experimentation.
management of all launched tasks with the data dependencies
Gateways such as MoSGrid [11] and Galaxy [12] support such
alone determining the order of execution.
workflows.
Apps: A Parsl script is comprised of standard Python code
Discovery science represents a new research methodology
plus a number of Apps—annotated units of Python code
based on explorative, interactive analysis. The general model
or external applications that specify their input and output
centers around analysis of large volumes of data with the aim
characteristics and that may be run in parallel. An App may
to find unknown patterns. Notebook environments, such as
be defined by wrapping an existing function or the execution
Jupyter, provide an ideal interface in which researchers can
of an external command-line application using Bash scripting
discover and explore large data volumes using a variety of
with the @App decorator. Listing 1 shows examples of these
analytics approaches. Such methods are used in a wide range
two types of Parsl Apps.
of studies from computing the stopping power of electrons
Futures: Parsl Apps are completely asynchronous. When
through materials to measuring discursive influence across
an App is invoked, there is no guarantee of when the result
scholarship [13]. Gateways such as Cloud Kotta [14] and
will be returned. Instead of directly returning a result, Parsl
HubZero [15] expose Jupyter notebook interfaces for inter-
returns an AppFuture: a construct that includes the real result
active computing.
as well as the status and exceptions for that asynchronous
Exploding data acquisition rates from scientific instruments,
function invocation. Parsl also supplies methods to examine
such as light sources, microscopes, and telescopes, neces-
the future construct, including checking status, blocking on
sitate rapid analysis to avoid data loss and enable online
completion, and retrieving results. Parsl leverages Python’s
experiment steering. Real-time (or online) computing, such
concurrent.futures module for this purpose.
as that conducted at the Advanced Photon Source, allows for
Parsl also introduces a model for managing the asyn-
data streamed from beamline computers to be processed in
chronous output files generated by an App invocation as
real-time on a large cluster, with the aim to make real-time
DataFutures. DataFutures extend the AppFuture model by
decisions during experiments [16].
providing support for a range of operations related to files.
III. PARSL MODEL A. Execution
The Parsl architecture is shown in Fig. 2. Parsl scripts are When instantiating the DFK, developers specify the spe-
decomposed into a simple dependency graph by the DataFlow cific execution providers and executors that will be used for
Kernel (DFK). The DFK manages execution of individual executing the parallel components of the script. Execution
Parsl Apps on a variety of sites. Unlike parallel scripting providers are simple abstractions over computational resources
languages like Swift, in which every variable and piece of code and executors provide an abstraction layer for executing tasks.
is asynchronous, Parsl relies on users to annotate functions that Parsl’s execution interface is called libsubmit [17]—a sim-
will be run asynchronously based on data dependencies. The ple Python library that provides a common interface to execu-
DFK provides a lightweight data management layer in which tion resources. Libsubmit’s interface defines operations such as
Python objects and files are staged to an execution site via a submission, status, and job management. It currently supports
dedicated communication channel or Globus. a variety of providers including Amazon Web Services, Mi-
Dataflow Kernel: The DFK provides a single lightweight crosoft Azure, and Jetstream clouds as well as Cobalt, Slurm,
abstraction on top of different execution resources. This ab- Torque, GridEngine, and HTCondor Local Resource Managers
straction is at the heart of Parsl’s ability to transparently (LRM). New execution providers can be easily added by
support different execution fabrics. implementing libsubmit’s execution provider interface.
2
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
Fig. 2: Parsl architecture. The DataFlow Kernel maps scripts to Executors that support diverse computational platforms.
Depending on the the selected execution provider, there are MPI applications that span nodes. It requires specific MPI
a number of ways to submit workload to that resource. For launchers supported by the target system such as aprun, srun,
example, for local execution, threads can be used, while for a mpirun, and mpiexec.
cluster, pilot jobs or specialized launchers can be used. Parsl
supports these different methods via its executor interface. C. Parallelism and elasticity
Parsl currently supports three executors: Rather than precompile a static representation of the entire
• ThreadPoolExecutor for multi-thread execution on local workflow, Parsl implements a dynamic dependency graph
resources. in which the graph is constructed as tasks are enqueued.
• IPyParallelExecutor for both local and remote execution As the Parsl script executes the workflow, new tasks are
using a pilot job model. The IPythonParallel controller added to a queue for execution, tasks are then executed
is deployed locally and IPythonParallel engines are de- asynchronously when their dependencies are met. Parsl uses
ployed on execution nodes. IPythonParallel then manages the selected executor(s) to manage task execution on the
the execution of tasks on connected engines. execution provider(s).
• Swift/TurbineExecutor for extreme-scale execution us- As Parsl manages a dynamic dependency graph it does
ing the Swift/T (Turbine) [18] model to enable distributed not know the full “width” of a particular workflow a priori.
task execution across an MPI environment. This executor Further, as a workflow executes, the needs of the tasks may
is typically used on supercomputers. change as too might the capacity available on execution
It is important to note that Parsl scripts are not tied to a providers. Thus, Parsl must elastically scale the resources it is
specific executor or execution provider. Furthermore, a single using. To do so, it includes an extensible flow control system
Parsl script may leverage multiple executors and execution to monitor outstanding tasks and available compute capacity.
providers concurrently—a model we refer to as multi site. This monitor, which can be extended or implemented by users,
This allows Parsl developers to mix and match resources and determines when to trigger scaling (in or out) events.
execution models to meet their needs. For example, enabling Parsl provides a simple user-managed model for control-
a computational simulation to run on specialized HPC nodes, ling elasticity. It allows users to prescribe the minimum and
simple data manipulation tasks to be executed locally using maximum number of blocks to be used on a given execution
threads, and visualizations to be rendered on GPU nodes. provider and a parameter (p) to control the level of parallelism.
Where parallelism is expressed as the ratio of TaskBlocks to
B. Uniform execution model active tasks. Each TaskBlock is capable of executing a single
Providing a uniform representation of heterogeneous re- task at any given time. Therefore, a parallelism value of 1
sources is one of the most difficult challenges for parallel represents aggressive scaling in which as many resources as
execution. Parsl provides an abstraction based on resource possible will be used; parallelism close to 0 represents the
units called blocks. A block is a single unit of resources opposite situation in which few resources (i.e., 1 TaskBlock)
that is obtained from an execution provider. Within a block will be used.
are a number of nodes. Parsl can then create TaskBlocks
within and across (e.g., for MPI jobs) nodes. A TaskBlock D. Data management
is a virtual suballocation in which individual tasks can be Parsl is designed to enable implementation of dataflow
launched. Figure 3 shows three different block configurations. patterns in which data passed between Apps manages the
The first configuration represents the most simple model in flow of execution. Dataflow programming models are popular
which a block is comprised of a single node with a single as they can cleanly express, via implicit parallelism, the
TaskBlock. The second configuration, with several TaskBlocks concurrency needed by many applications in a simple and
in a single node, is well suited for executing many, single intuitive way.
threaded applications on a multicore node. The final configu- Parsl aims to abstract not only parallel execution but also ex-
ration shows a block comprised of several nodes and offering ecution location, which in turn requires data location abstrac-
several TaskBlocks. This configuration is generally used by tion. For Python Apps, Parsl uses a direct channel between
3
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
(a) A block comprised of a node with one (b) A block comprised of a node with (c) A block comprised of four nodes with
TaskBlock. several TaskBlocks. two TaskBlocks.
Fig. 3: Parsl Block model showing several common block configurations.
the script and executors using Python object serialization. For checkpoint files when the DataFlow Kernel is initialized and
files, Parsl implements a simple abstraction that can be used written out to checkpoint files when explicitly requested.
to reference data irrespective of its location. At present this
model is limited to local and Globus [19] accessible files. IV. C ASE S TUDIES
The Parsl file abstraction is used to pass location- We present three workflows implemented using Parsl to
independent references between Apps. It requires that the illustrate how it can satisfy the needs of different application
developer initially define a file’s location (e.g., /local/path/file domains. While these workflows have not yet been imple-
or globus://endpoint/file). The file may then be passed to each mented in science gateways they represent use cases that
App and, when executed, Parsl will translate the location would benefit from gateway models.
to a locally accessible file path. In the case of Globus, an SwiftSeq [21] is a bioinformatics workflow that supports
explicit staging model is supported in which the developer aligning and genotyping gene panels, exomes, and whole
must select the execution site to which the file should be genomes. The Parsl-based workflow is comprised of approxi-
transferred. Parsl uses the Globus SDK and its native App mately 10 applications that communicate by writing and read-
authentication model [20] to authenticate with the Globus ing files. While applications must often execute in sequence,
service and securely move data between endpoints. there are also opportunities for parallelism. First, the workflow
is often executed on many samples, each of which can be
E. Caching analyzed in parallel; second, the large genetic sequences can
When developing a workflow, developers often execute the be divided up and analyzed in parallel; and finally, some of
same workflow with incremental changes over and over, this the applications themselves can also be executed in parallel.
scenario is especially prevalent in interactive computing work- SwiftSeq benefits not only from Parsl’s ability to specify such
flows. Often large fragments of the workflow have not been parallelism, but also from its ability to express a complex
changed yet are computed again, wasting valuable developer workflow, manage the flow of data between Apps, recover
time and computation resources. Caching of Apps (often called from errors, and execute on many computational resources.
memoization) solves this problem by saving results from Apps Parsl has been used in computational chemistry to de-
that have completed so that they can be re-used. Parsl’s velop molecular dynamics workflows. In one example, PACK-
caching model stores App results in an index alongside the MOL [22] is used to assemble initial starting configurations
App function, input parameters, and hash of the function body. of ionic liquid molecules with a protein (e.g., Trp-cage),
If caching is enabled, by an annotation on the App function before a GPU-accelerated version of Amber [23] is used
or globally at the workflow level, the cache is interrogated to energy minimize, heat, equilibrate, and run production
before each App executes. Caching is supported for Python molecular dynamics simulations. The workflow relies on three
and Bash Apps. Users must explicitly enable caching to avoid separate applications that are executed iteratively to perform
issues with non-deterministic applications. different functions. PACKMOL is used to generate the system
configuration, AmberTools are used to create input coordinate
F. Checkpointing and parameter files for simulations, and Amber is used to run
Large scale workflows are prone to errors due to node various simulations. Parsl allows a wide range of different
failures, application or environment errors, and myriad other system configurations to be considered in parallel, and it also
issues. Parsl provides fault tolerance via an incremental check- allows simple error handling logic to be expressed.
pointing model, where each checkpoint call saves all results In materials science, researchers have used Parsl to predict
that have been updated since the last checkpoint was created. the electronic stopping power of materials. Stopping power
When loading checkpoints, if entries with results from multiple is the predominant energy-loss mechanism for charged par-
functions (with identical hashes) are encountered, only the last ticles and is important for applications related to radiation
entry read will be considered. Checkpoints are loaded from protection. Historically, the stopping power for a material is
4
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
computed using analytical models such as the Lindhard model execution of external applications. Further, Luigi offers a
or using Time-Dependent Density Functional Theory (TD- execution model that deploys workers on a single cluster; it is
DFT). However, these methods are not suitable in all cases not designed to support multiple sites, provide elastic resource
and are computationally expensive. The Parsl-based workflow management, or handle wide area data staging.
uses TD-DFT calculations of a proton passing through a ma- FireWorks is a Python-based workflow engine designed
terial [24], transforms that data to a representation compatible for executing high-throughput workflows on supercomputers.
with machine learning, and then executes a number of machine Workflows are described in Python, JSON, or YAML and
learning algorithms to learn a predictive model. It finally as a collection of tasks which are connected together into a
applies these models from various directions to calculate a “FireWork” for execution. The centralized server manages the
three dimensional model of stopping power for a material. workflow, using a MongoDB database to provide persistence
Parsl was used as it was able to trivially parallelize the existing and to support reliable execution on distributed resources. Fire-
Python codebase, support the composition of a sophisticated Workers are deployed on compute resources to execute tasks,
machine learning pipeline in a Jupyter notebook, and facilitate they connect to the centralized server to request tasks, execute
scalable execution of the pipeline from within the notebook them, and return results. Unlike Parsl, FireWorks focuses on
on large-scale computing resources at the Argonne Leadership the reliable execution of long running jobs and therefore may
Computing Facility. not be suitable for short running jobs or applications that
demand a high submission rate.
V. R ELATED W ORK
Many workflow systems have been developed to facili- VI. S UMMARY
tate the expression and execution of arbitrary, data-oriented Parsl provides an easy-to-use model that can be easily
workflows, for example, the Swift parallel scripting language. integrated in science gateways to support the management
Other systems include Pegasus [25] and Galaxy [12]. A and execution of workflows composed of Python functions
weakness of these systems, however, is the need to develop and external applications. Science gateways benefit from the
a workflow representation in a separate representation (e.g., extensibility, scalability, and robustness of the Parsl model to
a graph) Parsl provides similar capabilities, directly in a manage execution of potentially complex workflows on arbi-
programming language that is broadly adopted by scientific trary computational resources. Parsl is specifically designed to
users and increasingly science gateways. address new workflow modalities, such as interactive comput-
There are a number of Python-based workflow tools that ing in Jupyter notebooks, and provides a seamless and trans-
better match common research environments, for example, parent way to scale these analyses from within the notebook.
Dask [26], Apache Airflow [27], Luigi [28], and Fire- Parsl abstracts the complexity of interacting with different
Works [29]. resource fabrics and execution models. It instead supports the
Dask is a parallel computing library designed for parallel development of resource-independent Python scripts. It also
analytics. It allows users to trivially migrate their single-node includes a number of advanced capabilities such as automated
analyses to a parallel execution environment. Unlike Parsl, elasticity, support for multi-site execution, fault tolerance, and
Dask scripts use Dask-specific functions in place of common automated direct and wide area data management.
libraries and programming constructs, for example using the
ACKNOWLEDGMENT
Dask DataFrame in place of the Pandas DataFrame. Like Parsl,
Dask decomposes a script into a dependent task graph that con- This work was supported in part by NSF award ACI-
trols the execution of code blocks. Parsl focuses on a broader 1550588 and DOE contract DE-AC02-06CH11357.
problem, including the ability to execute arbitrary applications
R EFERENCES
on heterogeneous computing resources and providing support
for managing data dependencies between these executions. [1] A. W. Toga, I. Foster, C. Kesselman, R. Madduri, K. Chard, E. W.
Deutsch, N. D. Price, G. Glusman, B. D. Heavner, I. D. Dinov, J. Ames,
Apache Airflow is a workflow engine written in Python. J. Van Horn, R. Kramer, and L. Hood, “Big biomedical data as the
Developers can express directed acyclic graphs of independent key resource for discovery science,” Journal of the American Medical
tasks. The Airflow scheduler is then responsible for executing Informatics Association, vol. 22, no. 6, pp. 1126–1131, 2015.
[2] N. P. Tatonetti, P. P. Ye, R. Daneshjou, and R. B. Altman, “Data-
the tasks on distributed workers according to their dependen- driven prediction of drug effects and interactions,” Science Translational
cies. Unlike Parsl’s implicit workflow model, Airflow relies Medicine, vol. 4, no. 125, pp. 125ra31–125ra31, 2012.
on users expressing their workflows as explicit tasks and with [3] L. Ward and C. Wolverton, “Atomistic calculations and materials infor-
matics: A review,” Current Opinion in Solid State and Materials Science,
explicit relationships between those tasks. Thus, the job of vol. 21, no. 3, pp. 167 – 176, 2017.
the user is to essentially describe a task dependency graph in [4] N. Wilkins-Diehr, “Special issue: science gateways - common com-
Python. munity interfaces to grid resources,” Concurrency and Computation:
Practice and Experience, vol. 19, no. 6, pp. 743–749, 2007.
Luigi scripts are created by writing Python classes that [5] S. Marru, L. Gunathilake, C. Herath, P. Tangchaisin, M. Pierce,
extend the Luigi task model: developers implement functions C. Mattmann, R. Singh, T. Gunarathne, E. Chinthaka, R. Gardler,
that manage input and output data, the code that will be run, A. Slominski, A. Douma, S. Perera, and S. Weerawarana, “Apache
Airavata: A framework for distributed applications and computational
as well as the explicit dependencies on other tasks. Unlike workflows,” in Proceedings of the 2011 ACM Workshop on Gateway
Parsl, Luigi focuses on Python tasks rather than orchestrating Computing Environments, 2011, pp. 21–28.
5
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
[6] P. Kacsuk, Z. Farkas, M. Kozlovszky, G. Hermann, A. Balasko, [17] “Libsubmit,” https://github.com/Parsl/libsubmit.
K. Karoczkai, and I. Marton, “WS-PGRADE/gUSE generic DCI gate- [18] T. G. Armstrong, J. M. Wozniak, M. Wilde, and I. T. Foster, “Compiler
way framework for a large variety of user communities,” Journal of Grid techniques for massively scalable implicit task parallelism,” in Proceed-
Computing, vol. 10, no. 4, pp. 601–630, Dec 2012. ings of the International Conference for High Performance Computing,
[7] T. Glatard, M. Étienne Rousseau, S. Camarasu-Pop, R. Adalat, N. Beck, Networking, Storage and Analysis, ser. SC ’14, 2014, pp. 299–310.
S. Das, R. F. da Silva, N. Khalili-Mahani, V. Korkhov, P.-O. Quirion, [19] K. Chard, S. Tuecke, and I. Foster, “Efficient and secure transfer,
P. Rioux, S. D. Olabarriaga, P. Bellec, and A. C. Evans, “Software synchronization, and sharing of big data,” IEEE Cloud Computing,
architectures to integrate workflow engines in science gateways,” Future vol. 1, no. 3, pp. 46–55, Sept 2014.
Generation Computer Systems, vol. 75, pp. 239 – 255, 2017. [20] S. Tuecke, R. Ananthakrishnan, K. Chard, M. Lidman, B. McCollam,
[8] M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and S. Rosen, and I. Foster, “Globus Auth: A research identity and access
I. Foster, “Swift: A language for distributed parallel scripting,” Parallel management platform,” in 12th IEEE International Conference on e-
Computing, vol. 37, no. 9, pp. 633–652, Sep. 2011. Science (e-Science), Oct 2016, pp. 203–212.
[9] K. Chard, S. Tuecke, and I. Foster, “Efficient and secure transfer, [21] J. Pitt, “Swiftseq,” http://www.igsb.org/software/swiftseq.
synchronization, and sharing of big data,” IEEE Cloud Computing, [22] L. Martı́nez, R. Andrade, E. G. Birgin, and J. M. Martı́nez, “PACKMOL:
vol. 1, no. 3, pp. 46–55, Sept 2014. A package for building initial configurations for molecular dynamics
[10] A. N. Adhikari, J. Peng, M. Wilde, J. Xu, K. F. Freed, and T. R. Sosnick, simulations,” J. Comp. Chemistry, vol. 30, no. 13, pp. 2157–2164, 2009.
“Modeling large regions in proteins: Applications to loops, termini, and [23] D. A. Case, T. E. Cheatham, T. Darden, H. Gohlke, R. Luo, K. M.
folding,” Protein Science, vol. 21, no. 1, pp. 107–121, 2012. Merz et al., “The Amber biomolecular simulation programs,” J. Comp.
[11] J. Krüger, R. Grunzke, S. Gesing, S. Breuers, A. Brinkmann, L. de la Chemistry, vol. 26, no. 16, pp. 1668–1688, 2005.
Garza, O. Kohlbacher, M. Kruse, W. E. Nagel, L. Packschies, R. Müller- [24] A. Schleife, Y. Kanai, and A. A. Correa, “Accurate atomistic first-
Pfefferkorn, P. Schäfer, C. Schärfe, T. Steinke, T. Schlemmer, K. D. principles calculations of electronic stopping,” Phys. Rev. B, vol. 91,
Warzecha, A. Zink, and S. Herres-Pawlis, “The MoSGrid science p. 014306, Jan 2015.
gateway – a complete solution for molecular simulations,” Journal of [25] E. Deelman, G. Singh, M.-H. Su, Y. Blythe, James Gil et al., “Pegasus:
Chemical Theory and Computation, vol. 10, no. 6, pp. 2232–2245, 2014. A framework for mapping complex scientific workflows onto distributed
[12] E. Afgan, D. Baker, M. van den Beek et al., “The Galaxy platform systems,” Scientific Programming, vol. 13, no. 3, pp. 219–237, 2005.
for accessible, reproducible and collaborative biomedical analyses: 2016 [26] M. Rocklin, “Dask: Parallel computation with blocked algorithms and
update,” Nucleic Acids Res., vol. 44, no. W1, p. W3, 2016. task scheduling,” in Proc. 14th Python in Sci. Conf., 2015, pp. 130–136.
[13] A. Gerow, Y. Hu, J. Boyd-Graber, D. M. Blei, and J. A. Evans, [27] Apache Airflow Project, “Apache Airflow,”
“Measuring discursive influence across scholarship,” Proceedings of the https://airflow.incubator.apache.org/.
National Academy of Sciences, 2018. [28] Spotify, “Luigi,” https://github.com/spotify/luigi.
[14] Y. N. Babuji, K. Chard, and E. Duede, “Enabling interactive analytics [29] A. Jain, S. P. Ong, W. Chen, B. Medasani, X. Qu, M. Kocher,
of secure data using cloud kotta,” in 8th Workshop on Scientific Cloud M. Brafman, G. Petretto, G. Rignanese, G. Hautier, D. Gunter, and
Computing, ser. ScienceCloud ’17, 2017, pp. 9–15. K. A. Persson, “Fireworks: a dynamic workflow system designed for
[15] M. McLennan and R. Kennell, “HUBzero: A platform for dissemination high-throughput applications,” Concurrency and Computation: Practice
and collaboration in computational science and engineering,” IEEE Des. and Experience, vol. 27, no. 17, pp. 5037–5059, 2015.
Test, vol. 12, no. 2, pp. 48–53, Mar. 2010.
[16] T. Bicer, D. Gursoy, R. Kettimuthu, I. T. Foster, B. Ren, V. D. Andrede,
and F. D. Carlo, “Real-time data analysis and autonomous steering
of synchrotron light source experiments,” in 13th IEEE International
Conference on e-Science (e-Science), Oct 2017, pp. 59–68.
6