Bringing Scientific Workflow to the Masses via
                    Pegasus and HUBzero

             Michael McLennan, Steven Clark                                                        Frank McKenna
       HUBzero Platform for Scientific Collaboration                                       University of California, Berkeley
                   Purdue University                                                             Berkeley, CA, USA
               West Lafayette, IN, USA                                                       fmckenna@ce.berkeley.edu
       mmclennan@purdue.edu, clarks@purdue.edu

                                                                                            Derrick Kearney, Carol Song
         Ewa Deelman, Mats Rynge, Karan Vahi                                                     Purdue University
              Information Sciences Institute                                                 West Lafayette, IN, USA
             University of Southern California                                         dsk@purdue.edu, carolxsong@purdue.edu
                Marina del Rey, CA, USA
       deelman@isi.edu, rynge@isi.edu, vahi@isi.edu


    Abstract— Scientific workflow managers are powerful tools               manager (DAGMan) within HTCondor [2], were built to
for handling large computational tasks. Domain scientists find it           manage thousands of jobs in a high-throughput computing
difficult to create new workflows, so many tasks that could                 environment. Taverna [3] graphically connects bioinformatics
benefit from workflow automation are often avoided or done by               web services together into a coherent flow. Kepler [4] also
hand. Two technologies have come together to bring the benefits             provides graphical user interfaces for workflow composition
of workflow to the masses. The Pegasus Workflow Management                  and supports different workflow execution models. The
System can manage workflows comprised of millions of tasks, all             SHIWA simulation platform uses the WS-PGRADE/gUSE
the while recording data about the execution and intermediate               technology to provide workflow interoperability across a
results so that the provenance of the final result is clear. The
                                                                            number of systems [5]. Recently, Galaxy [6] is also gaining
HUBzero platform for scientific collaboration provides a venue
                                                                            popularity in the bioinformatics domain, particularly because it
for building and delivering tools to researchers and educators.
With the press of a button, these tools can launch Pegasus                  tailors its environment to their community.
workflows on national computing infrastructures and bring                       Once a particular workflow has been established, it is
results back for plotting and visualization. As a result, the               relatively easy for others to execute it. However, creating new
combination of Pegasus and HUBzero is bringing high-                        workflows has proven to be an enormous challenge—
throughput computing to a much wider audience.                              especially for domain scientists with very little background in
                                                                            computer science. Creating a workflow is a lot like writing a
    Keywords—computation;         workflow;     collaboratories;    user
interfaces; automation
                                                                            program that calls standard library functions. The author not
                                                                            only must have a deep understanding of the science but also
                         I.     INTRODUCTION                                must understand the interface for each component of the
                                                                            workflow, including the data files required and the output files
    Computation has brought about a revolution in science. As               produced. The output of one component may need format
software was created to analyze large amounts of data,                      conversions or additional processing before it can be used by
researchers found they could gather even more data, which                   another component. The author must know what programs are
necessitated even more software and more computing power.                   available to perform conversions, or in some cases, the author
Today, many analysis tasks require not just a single program or             must write little conversion programs (often called “shims”) to
computer, but dozens of different programs running across                   complete the workflow. Components of the workflow may
thousands of computational nodes. Handling any one of those                 have limitations or may fail when given certain combinations
tasks requires a complex orchestration of moving data to                    of input parameters. The author must understand that too, and
appropriate nodes, finding or staging the executables, starting             work around such issues.
jobs, handling data flow dependencies, and overcoming job
failures.                                                                       Several projects have tried to simplify the creation of
                                                                            workflows by building drag-and-drop editors. Taverna [3],
    Over the past decade, several frameworks have been                      Kepler [4], Galaxy [6], and GridNexus [7] all allow users to
created to support the execution of such large computational                drag blocks onto a canvas and connect inputs and outputs
tasks. The Pegasus Workflow Management System (Pegasus                      together in a flow. But while these applications simplify the
WMS) [1] and its workflow engine, the directed acyclic graph                task of expressing a workflow, they do not solve the inherent
    This work was supported by the US National Science Foundation through
grants CBET-0941302, CMMI-0927178, OCI-1148515, and OCI-0943705.
problems of understanding components, building shims, or           workflow mapping process involves finding the appropriate
working around limitations. The Taverna project has taken one      software, data, and computational resources required for
step further and created a web site at MyExperiment.org for        workflow execution. The mapping process can also involve
publishing and sharing workflows [8].                              restructuring the workflow to optimize performance and adding
                                                                   transformations for data management and provenance
     Our approach takes workflow yet another step further by
                                                                   information generation. DAGMan, Pegasus’s workflow engine,
delivering live simulation tools embodying workflows to the        relies on the resources (compute, storage and network) defined
masses via a web browser. These tools are powered by
                                                                   in the executable workflow to perform the necessary actions.
Pegasus WMS running on Open Science Grid [9], DiaGrid              Individual workflow tasks are managed by a task scheduler
[10], and XSEDE resources [11]. The tools are created by the
                                                                   (Condor), which supervises their execution on local and remote
few skilled users within a community, but wrapped in graphical     resources.
user interfaces with integrated plotting and visualization, and
delivered to the masses via the HUBzero software platform.                   III.   HUBS FOR SCIENTIFIC COMMUNITIES
There are now more than 40 hubs covering a wide range of
scientific disciplines, including nanotechnology, earthquake       A. Building a Community on nanoHUB.org
mitigation, manufacturing, healthcare, pharmaceuticals,                In 2002, the US National Science Foundation created the
volcanoes, and climate change. There is also a hub for the         Network for Computational Nanotechnology (NCN), a
HUBzero community at http://hubzero.org, which includes a          collection of universities engaged in simulation and modeling
list of known hubs. All together, these hubs have served more      for the nanotechnology community [15]. NCN established a
than 850,000 visitors from 172 countries worldwide—not over        web presence at nanoHUB.org and offered an array of
all time, but during the calendar year 2012 alone. Pegasus and     simulation and modeling tools, along with seminars, tutorials,
HUBzero act as a conduit for these users, giving them fingertip    courses, and other supporting materials. Over the years,
access to national Grid computing infrastructures.                 nanoHUB.org evolved into a software platform supporting
                                                                   collaboration in private group areas, discussion forums, ratings
    II.   THE PEGASUS WORKFLOW MANAGEMENT SYSTEM                   and reviews, and most importantly, deployment of new
    The Pegasus Workflow Management System (or Pegasus)            simulation tools via its unique middleware. In 2007, the
manages the execution of scientific workflows on desktops,         underlying software for nanoHUB.org was spun off as a
private clusters, campus clusters, grids, and academic and         separate project, the HUBzero® Platform for Scientific
commercial clouds. It automatically locates the necessary input    Collaboration [16]. Since then, it has been used to create
data and computational resources needed for workflow               similar hubs for many other scientific areas—all supported by
execution; thus, it allows scientists to specify their workflows   the same underlying middleware and content management
at a high-level of abstraction (devoid of resource information)    system.
and then maps/plans this description onto the available
                                                                       nanoHUB grew from 10 tools and 1,000 users in 2002 to
execution resources. Next, Pegasus reliably executes this plan.
                                                                   more than 260 tools and 250,000 users today [17]. User
As a result, Pegasus workflows are easy to compose and are
                                                                   contributions were key to that growth. Each new tool, seminar,
portable across heterogeneous cyberinfrastructure.
                                                                   or tutorial brought new users to the site, some of whom
    Pegasus is capable of executing workflows on a single          contributed their own new content. This virtuous cycle was
resource or across resources. Even a single workflow               fueled by an open content management system that enabled
description can be distributed and executed across the             users to upload and deploy their own content according to their
cyberinfrastructure. Pegasus has been used to run workflows        own schedule, with minimal intervention from the nanoHUB
ranging from just a few computational tasks up to millions.        team.
When errors occur, Pegasus tries to recover when possible by
                                                                       Uploading and publishing a single document or PowerPoint
retrying tasks, by retrying the entire workflow, by providing
                                                                   presentation is one thing; uploading and deploying a simulation
workflow-level checkpointing, by re-mapping portions of the
                                                                   tool is quite another. Tools have hundreds of individual source
workflow, by trying alternative data sources for staging data,
                                                                   code files; they must be compiled and tested within the hub
and, when all else fails, by providing a rescue workflow
                                                                   environment; and all of this must be done by untrusted users
containing a description of only the work that remains to be
                                                                   with minimal intervention. The HUBzero infrastructure was
done [1]. Thus, it tries what it can to shield the user from
                                                                   designed to support this activity. Researchers upload their code
errors and intermittent issues with the computing resources. It
                                                                   into private project areas with a Subversion [18] repository for
cleans up storage as the workflow is executed so that data-
                                                                   source code control; they compile, test, and modify their code
intensive workflows have enough space to execute on storage-
                                                                   within a secure execution container managed by OpenVZ [19]
constrained resources [12]. Pegasus keeps track of what has
                                                                   running on a cloud of execution hosts; and they approve tools
been done (provenance) including the locations of data used
                                                                   for deployment via a Web-based content management system.
and produced, and which software was used with which
                                                                   All of this enables their code to be separated from the system
parameters [13],[14].
                                                                   and from other users, and yet deployed in a secure and scalable
    In order to support the abstract workflow specifications,      manner.
which let scientists concentrate on their science rather than on
                                                                       End users find a tool on a hub and press a button to launch
the operational aspects of the cyberinfrastructure, mapping
                                                                   it. The tool is launched in a Linux/X11 environment within a
technologies are needed to automatically interpret and map
                                                                   secure container on the “cloud of execution hosts,” and the
user-defined workflows onto the available resources. The
graphical output from the session is piped through the Web to                  the NSF George E. Brown Network for Earthquake
the user’s browser via VNC [20]. The result is a live,                         Engineering Simulation (NEES) moved operations to Purdue
interactive simulation tool running in a secure environment,                   and created a hub for the civil engineering community. Today,
capable of running workflows on national computing                             NEES.org offers more than 65 simulation and data analysis
infrastructure, and accessible to any user via a Web browser                   tools used to understand the damage caused by earthquakes and
without any special software installation.                                     improve building design. One of these tools leverages an open
                                                                               source code, the Open System for Earthquake Engineering
    There are many other gateway frameworks, including the
                                                                               Simulation (OpenSees) [23], to provide a collection of utilities
Distributed Application Runtime Environment [21], Open Grid                    for structural and geotechnical engineers.
Computing Environments [22], and WS-PGRADE/gUSE [5],
but none provide direct access to interactive tools and scientific                  One of the utilities within this OpenSees Laboratory tool
visualization. This is a distinct advantage of the HUBzero                     [24] is the Moment Frame Earthquake Reliability Analysis.
platform.                                                                      This tool is intended to demonstrate to engineers the
                                                                               importance of uncertainty in their models and the need to use
    There are many different users that work together within
                                                                               probabilistic methods when determining important response
each hub community, and their interactions are mediated by the                 measures. The tool prompts the user for the material properties
HUBzero platform. A hub owner creates a new hub and
                                                                               including probability distributions, building layout, and then a
engages others in the community to seed the site with initial                  list of earthquake ground motion records. On submission, the
content. Users browse the site and take advantage of
                                                                               tool automatically builds and executes a Pegasus workflow.
interesting content, perhaps by watching a seminar or running a                The workflow manages several hundred to tens of thousands of
simulation tool. Users also contribute their own content to the
                                                                               separate OpenSees jobs on the Open Science Grid (OSG),
site, including new workflows and simulation tools. Hub                        which together perform a reliability analysis of the building
administrators help to compile and stage new simulation tools
                                                                               frame under seismic loads, as shown in Fig. 1. The resulting
and ultimately approve all content before it is released.                      plots will show the variation in roof displacement, interstory
B. NEES.org and OpenSees Laboratory                                            drift, and base shear given the uncertainty in the materials for
   Other hubs leverage the same HUBzero infrastructure to                      each of the earthquake motions specified. A single dot in red
support different toolsets for their own community. In 2009,                   for each earthquake shows the engineer what would happen if


 Fig. 1. The OpenSees Laboratory on NEES.org, like all tools managed by HUBzero, uses a graphical user interface to prompt for input values and plot
       results. Tool sessions run within secure containers on an execution host, and can submit individual runs or Pegasus workflows out to remote sites.


 Fig. 2. The BLASTer tool on DiaGrid.org manages the execution of BLAST programs on DiaGrid. NCBI sequence databases are kept up to date on shared
       storage, so they do not need to be transferred for each run.
only the mean value had been used for the material properties                     DiaGrid.org offers a similar tool for statistical analysis
for that earthquake. The advantage of using Pegasus and the                   scripts written in the R programming language [27]. It is quite
OSG for such a workflow is that Pegasus recognizes that most                  common for researchers to run a particular script over a wide
of the computation can be performed in parallel and the OSG                   range of input parameters to explore the parameter space or to
permits concurrent job submission and execution. As a                         perform a sensitivity analysis. The SubmitR tool [28] takes a
consequence, the resulting time spent obtaining these results                 script, data files, and a specification of parameter values, and
can be a few minutes and hours instead of many hours, weeks                   automatically builds and executes the Pegasus workflow to run
and even months that would be required if the results were                    a large number of R jobs in parallel on DiaGrid. Again, this
obtained from a series of sequential analyses.                                appeals to the domain scientist who is quite familiar with R,
                                                                              but knows nothing about workflows, Condor, or DiaGrid.
   To date, OpenSees Laboratory has served 950 simulation
users from the US, China, Canada, Italy, Japan, and other                                  IV.     CREATING SCIENTIFIC WORKFLOWS
countries all over the world. About a third of these users are
coming from academic institutions. All of them are driving                    A. Job Submission via “submit”
OpenSees through a graphical user interface, perhaps without                      The integration of Pegasus and HUBzero certainly makes it
even knowing if and when they are launching complex                           easier for end users to launch workflows, but also makes it
workflows.                                                                    easier for tool developers to build and test new workflows.
                                                                              Each hub comes with one tool called a Workspace, which is a
C. DiaGrid.org and BLASTer, SubmitR
                                                                              full-featured Linux desktop accessible via the Web. Tool
    DiaGrid is a high-throughput computing resource utilizing                 developers use this to upload, compile, and test their code
the Condor system, with more than 50,000 cores from                           before deploying a tool. Experienced researchers also use this
machines at Purdue University and 9 other campuses. Since its                 to get past the constraints of a graphical interface—to edit files,
inception in 2005, DiaGrid has handled more than 90,000,000                   write simple scripts, and dispatch computational jobs.
simulation jobs.
                                                                                  HUBzero provides a command called “submit,” which is
    In 2011, Purdue established a hub at DiaGrid.org to provide               used within the Workspace to dispatch jobs, as shown in Fig. 3.
more widespread access to the underlying DiaGrid execution                    In this example, the command line “spice3 –b circuit”
pool. DiaGrid.org offers a tool called BLASTer [25], which                    represents the research code that is being executed. Electrical
can be used to run BLAST (Basic Local Alignment Search                        engineers will recognize this as the SPICE3F4 program created
Tool) [26], a bioinformatics tool for analyzing DNA sequence                  back in the 1970’s at UC Berkeley and still in use today [29].
data. Users upload a query file containing one or more                        But the approach we describe would apply to any such research
nucleotide sequences and search against standard databases                    code amenable to command line execution. The command
from the National Center for Biotechnology Information                        spice3 is the executable, the flag –b means to execute in
(NCBI), or against their own custom databases. The run for a                  batch mode, and the file circuit contains the netlist
single sequence may take anywhere from 10 seconds to 30                       representing the electrical circuit being simulated.
minutes. But many researchers bundle thousands of searches
into a single request. Performed linearly, such a search might                    When invoked directly on the command line as in Fig. 3(a),
take several days or weeks of computation. BLASTer                            the spice3 program runs locally within the execution host
automatically divides the query file into chunks and creates a                managing the tool session (see Fig. 1). However, prefixing the
Pegasus workflow to distribute the search over hundreds of                    usual command line with submit will send the job off to
DiaGrid nodes, so the end-to-end execution time can be                        remote execution venues, such as DiaGrid or the Open Science
shortened from weeks to a matter of hours. Shared storage                     Grid. The “-v DiaGrid” arguments shown in Fig. 3(b)
allows the NCBI standard databases, updated regularly, to be                  request that the job be sent to DiaGrid. The submit command
available to all users without requiring file transfer, as shown in           will automatically transfer files such as circuit that are
Fig. 2, thereby further reducing the time to results.                         required for simulation.        It will speak the appropriate
                                                                              protocol—currently Condor, Condor-G, Load Leveler, LSF,

 (a) spice3 –b circuit
 (b) submit –v DiaGrid spice3 –b circuit
 (c) submit –p @@res=100,1k,10k spice3 –b @:circuit
 (d) submit –p @@res=100,1k,10k –p @@cap=1u,10u,100u spice3 –b @:circuit
 (e) submit –d indata.csv spice3 –b @:circuit
 (f) submit –p @@num=1:100 spice3 –b circuit@@num
 (g) submit –p @@file=glob:circuit* spice3 –b @@file
 (h) submit pegasus-plan --dax myworkflow.dax
 Fig. 3. HUBzero’s submit command makes it easy to send jobs off to remote sites, and includes support for parameter sweeps and Pegasus workflows.
SGE, PBS, or SLURM—to queue the job. And, it will monitor           “spice3 –b circuit2”, and so forth, referencing the series
progress and transfer results back to the execution host once       of numbered files that the user created by hand for this run.
the run is complete. It runs all jobs with a common credential
                                                                        Input files can also be recognized using glob-style pattern
owned by the hub, so individual users need not have an account
or any credentials for the remote venue. The hub tracks all         matching. The arguments “-p @@file=glob:circuit*” in
                                                                    Fig. 3(g) define a parameter named @@file that takes on all
jobs, including those run on remote venues, so that if a security
incident arises, a complete report can be generated for any         values matching the glob-style pattern circuit*. With this
                                                                    syntax, the files could be distinguished by different numbers,
affected venue. The same accounting is used to produce
monthly reports of the overall simulation usage for each hub.       different letters, or other naming conventions, but all files that
                                                                    submit can find matching circuit* will be simulated as
B. Simple Workflows: Parameter Sweeps                               separate cases, with @@file representing the whole name of
    The submit command can also manage parameter sweeps             each matching file, such as circuit1, circuit-2b,
with a very simple command line, as shown in Fig. 3(c).             circuitXYZ, etc.
The -p flag indicates that the next argument is a parameter         C. Customized Workflows in Pegasus
specification. In this case, the parameter @@res is taking three
values: 100, 1k, and 10k. The @: prefix before the circuit              More experienced users can learn how to create their own
                                                                    customized Pegasus workflows. Each workflow is expressed
file indicates that this file should be processed with parameter
substitutions. In other words, the submit command treats            as a high-level directed acyclic graph (DAG) called a DAX (for
                                                                    DAG XML description). Each node in the DAX represents a
circuit as a template file, and substitutes all occurrences of
@@res with a particular parameter value. In this case, it
                                                                    job in the workflow, and the edges between nodes represent
                                                                    data files that flow from one job to the next. The DAX is
produces three separate input files—one for each of the
parameter values. Then, it automatically builds a Pegasus           similar to the Condor DAGMan file [30] but at a higher level
                                                                    of abstraction so that it can be targeted to different execution
workflow to manage the jobs, and dispatches the workflow off
to an appropriate venue. Since the optional –v argument was         environments.
not specified in this case, the submit command will                     Each DAX is ultimately expressed as an XML file. Writing
automatically select a venue that supports Pegasus job              that file by hand, however, can be tedious and impractical.
submission and data transfer interfaces, such as DiaGrid or         Many developers prefer to express their workflow as small
Open Science Grid.                                                  program written in Python, Java, or Perl. Pegasus provides
                                                                    Application Programming Interfaces (APIs) in these languages
    A simulation run may have multiple parameters. In
Fig. 3(d), the parameter @@res is swept through three values,       to help construct the needed XML. The workflow designer
                                                                    needs to build up the nodes and edges of a DAX object, and the
and the parameter @@cap is swept through three values. The
submit command creates 9 different input files representing
                                                                    API is used to write out the object in XML format. Running
                                                                    the program, therefore, produces a DAX file expressing the
all possible combinations of these two parameters, builds a
Pegasus workflow to manage the jobs, and then dispatches the        entire workflow, suitable for submission.
workflow to an appropriate venue.                                       The submit command described earlier is also used to
    Instead of specifying input parameters on the command           launch Pegasus workflows, as shown in Fig. 3(h). This time the
                                                                    programs being launched (such as spice3) do not appear on
line, the values for all combinations of parameters can be
                                                                    the command line; instead, they are encoded within the DAX
stored in a comma-separated value (CSV) file. In Fig. 3(e), the
arguments “-d indata.csv” indicate that parameter values            description of the workflow, which in this case is a file named
                                                                    myworkflow.dax. The submit command invokes another
should be loaded from the file indata.csv. The first line
should contain the parameter names for the columns, such as         program called pegasus-plan which reads the DAX,
                                                                    “compiles” the workflow for a specific execution venue, then
“@@res,@@cap” in this example. Each of the remaining lines
represents the parameter values for a single job, such as           launches the workflow and manages execution.
“100,1u” or “10k,100u” for this example.                Passing         For example, the Moment Frame Earthquake Reliability
parameters in this way is useful when the desired values are not    Analysis tool, discussed in Section III.B, uses both the Python
a simple combination of values, but perhaps a series of values      API provided by Pegasus and the submit command. At the
generated by a genetic optimization algorithm, or by a Latin        launch of the simulation by the user, the tool invokes a shell
Hypercube sampler.                                                  script that first invokes a Python script (opensees-dax.py)
    Instead of relying on template files and parameter              to create the XML file and then invokes the submit command
                                                                    using the pegasus-plan program and the XML file to run the
substitutions, the user may generate a series of separate input
files. For example, suppose a user has 100 different circuit        workflow on the OSG, as follows:
files to evaluate. The syntax shown in Fig. 3(f) supports this      #!/bin/sh
scenario. The arguments “-p @@num=1:100” define a
                                                                    python opensees-dax.py –-nMat=$1 –-nMotion=$2 > dax.xml
parameter @@num that takes on integer values from 1 to 100. In
this case, the circuit input file is specified as circuit@@num,     submit pegasus-plan --dax dax.xml
so the submit command builds a Pegasus workflow with 100
different command lines, substituting the value of @@num for            The workflow compilation or “planning” step is the
each job to produce commands like “spice3 –b circuit1”,             strength of the Pegasus approach. It optimizes the workflow
                                                                    by removing any branches where partial results are in place and
where dependencies are already satisfied. It also augments the       including seminars, tutorials, teaching materials, and most
workflow, adding tasks to stage data files in/out of remote          importantly, computational tools. Any researcher within a
resources, and to clean up unnecessary results after execution.      particular hub community can click on the Upload link, fill out
And, it can partition the workflow and cluster short-running         a form describing their new tool, and get immediate access to a
jobs together, so that they run more efficiently on a single         private project area complete with a Subversion source code
compute node. It does all of this at a high-level, freeing the       repository and a wiki area for project notes. The researcher can
user from having to request or identify specific resources.          launch the Workspace tool, check out the latest source code,
During execution, Pegasus dispatches jobs, monitors progress,        and compile and test within the workspace environment. The
and automatically retries failing jobs so that sporadic failures     researcher can build Pegasus workflows and test them right at
do not derail the entire workflow. If a workflow does fail (for      the command line by typing the command shown earlier in
example, if the venue goes down, or if a particular job triggers     Fig. 3(h). Once that part is working, the researcher can create a
a bug in the code), it can be corrected and restarted by the user.   graphical interface using Rappture or any other toolkit, and
Pegasus will recognize partial results and pick up where it left     embed the same submit command within the code of their
off to complete the workflow.                                        new tool.
        V.    BUILDING AND DEPLOYING APPLICATIONS                        Once the tool is working properly, the researcher can visit
                                                                     the page representing their tool contribution and click a link
A. Graphical User Interfaces and the Rappture Toolkit                saying, “My code is committed, working, and ready to be
    Researchers can build their own graphical user interfaces        installed.” This signals the hub administrators to check out the
on top of the basic submit capability, so that other (perhaps        latest code, compile it, and install it in the official /apps
less experienced) users can access a simplified interface and        directory under a subdirectory with the tool name and its
launch powerful workflows for targeted applications, as              revision number from the source code control system.
described earlier in Section III. Such tools can be created using         Once a tool is installed, the researcher receives a message
MATLAB®, Java, Qt, GTK+, wxWidgets, or any other toolkit             to begin testing. Clicking a Launch Tool button brings up a
that runs under Linux/X11. For example, the interfaces for           preview of the final tool, exactly as other users of the hub
BLASTer and SubmitR were both created using Java and                 would see it. Once the tool has been verified, the researcher
Python, respectively.                                                can click a link saying, “My tool is working properly. I approve
    HUBzero’s Rappture toolkit provides an easy way to create        it.”
such graphical interfaces. Rappture comes pre-installed within           That prompts the hub administrators to take one last look,
the Workspace tool and includes an interactive “builder”             to verify that the tool is indeed functional and has an adequate
application for new tools. Researchers specify the inputs and        description page, and then move the tool to the “published”
outputs for each new tool by dragging objects from a palette of      state, where it becomes available to other users according to the
available controls and by dropping them into an interface            restrictions set by the tool authors. A tool can be open to the
specification tree. Each object is given a label, description, a     world, or protected by export control, or accessible only by a
default value, and other attributes, such as units of measure for    certain group of users. If a tool is published as Open Source,
numeric values. The builder provides a preview of the                an archive of the source code is presented for download
resulting interface and generates errors and warnings if any         alongside the Launch Tool button, and the terms of the Open
elements are missing or incomplete. The builder saves each           Source license are clearly displayed on the tool page. All tools
interface in an XML description file, and it can also generate a     have a unique Digital Object Identifier (DOI), so they can be
skeleton for the main program of the tool in a variety of            cited within academic publications.
programming languages, including C/C++, Fortran, MATLAB,
Octave, Java, Python, Perl, R, Ruby, and Tcl/Tk. The                 C. Ongoing Tool Support
researcher modifies the body of the skeleton code to include             The publication of a tool is not the end of the story. Other
the core of the tool, then runs the rappture command to              users may encounter a bug in the tool and file a support ticket
produce the graphical interface.                                     on the hub. Any ticket filed on a live tool session page is
    However a graphical interface is created, it acts as an          automatically routed to the tool authors and becomes their
intermediary for the user, gathering input values, launching         responsibility to fix. The authors receive email about the
simulation runs, and plotting results. Remote jobs and               ticket, and they can see it listed on the My Tickets module
workflow submissions are handled via something like the C            within their My Account dashboard page. They can update the
language system() function, which forks and executes a               ticket to communicate with the user, check out the code within
separate shell process to handle a command line string with the      a Workspace, commit a fix, and close the ticket. When the
submit command shown earlier in Fig. 3. Scripting languages          code is ready for another release, the authors click a link
are particularly well suited for this task, so it is quite common    saying, “I've made changes Please install the latest code for
to build an interface tool in Python, for example, which calls       testing and approval.” This takes the tool back to the
submit to dispatch a complex workflow built from a series of         “updated” state, causing hub administrators to stage the latest
C/C++ and Fortran programs.                                          version, letting the tool authors test and approve the latest
                                                                     version, leading to an updated tool publication.
B. Tool Publication Process
                                                                        Each tool also has its own question/answer forum for
    HUBzero includes a powerful content management system            community discussion. Another user might not understand the
for uploading and deploying many different types of content,
physics within the tool and post a question. One of the tool         a progress report), the question was how to provide the
authors, or perhaps another user, might follow up with an            information back through the tool to the end user. In particular,
answer. Researchers can have threaded discussions within the         we ran into a problem where planning a workflow was taking a
question/answer area, and they earn points for participation.        long time—on the order of 30 minutes! During this time, there
The points can be redeemed for merchandise on some hubs,             was no feedback to the user that the system was actually doing
traded for other favors, or used as bragging rights.                 anything. We are still working on a solution to provide better
                                                                     output from the pegasus-plan operation, but once the
    End users may like a tool, but may have requests for
                                                                     workflow is planned and is running, the hub regularly runs
improvement, such as support for extra model parameters or           pegasus-status and provides feedback to the user. Once the
material types. They can post such ideas on the wish list for
                                                                     workflow has finished, the pegasus-statistics and
each tool. Other users can vote the ideas up or down. The tool       pegasus-analyzer commands are run automatically to
authors can evaluate each wish on a scale of importance and
                                                                     provide the user with summary statistics, and if something
effort, so that wishes can be sorted in order of priority. Ideas     failed, a report on what failed. Improving these integration
that are deemed important and low effort bubble up to the top
                                                                     points is an ongoing effort for the HUBzero and Pegasus
of the list, while those that are unimportant and high effort sink   developers.
to the bottom. When any wish is granted, the original user is
notified and the tool author may earn points for fulfilling the          Some integration issues required just simple code changes.
request.                                                             One example is how a report generated by the
                                                                     pegasus-statistics tool was modified to better match the
               VI.    INTEGRATION CHALLENGES                         data needed inside HUBzero. The changes included adding a
   During the integration work, we encountered some                  few columns to the job/host breakdown, and providing the data
challenges including data management, feedback and workflow          in comma-separated value (CSV) format so that the HUBzero
progress in the user interface, and also providing workflow          framework could easily import the data at the end of a
metrics from Pegasus to the HUBzero framework.                       workflow run.
    When the integration effort started, the most recent Pegasus                              VII. CONCLUSION
release was version 3. The data management in that version
                                                                         The integration of Pegasus into the HUBzero framework
was still based on a model of high-performance computing             has brought the power of automated workflows to many more
(HPC) systems, with assumptions such as a shared parallel
                                                                     users. Expert users can launch a Workspace within a hub and
filesystem being mounted on the compute nodes used for the           gain immediate access to Pegasus and computational
workflow. Under this data management model, when a
                                                                     resources—without having to install or setup the Pegasus
workflow was submitted, a particular compute resource had to         software, without having to apply for Grid credentials, and
be chosen, the data had to be transferred in to the shared
                                                                     without having to learn complex Grid protocols. A simple
filesystem, and after the jobs were run, the output data had to      submit command dispatches any single job or complex
be staged out. These steps were handled by Pegasus, but the
                                                                     workflow off to remote facilities for execution. Expert users
fact that a resource had to be chosen up-front meant that either     can experiment with such commands, use them in research
the hub or the end user would have to select a resource to run
                                                                     activities, and wrap them into graphical user interfaces that are
on. In the case of high throughput workloads, for which it does      deployed as new tools for the community.
not matter where the jobs run, selecting a resource was not a
very user-friendly solution, since at any given moment, a                Without knowing a thing about workflows or high-
particular resource might be heavily loaded or broken, while         throughput computing, others in the community can access
other venues might be working fine. This particular integration      such tools, enter their own input parameters, and launch a
problem went away with the release of Pegasus 4.0. Two new           complex workflow with the press of a button. If the users have
data management approaches were added to support workloads           questions, they can post them for the tool authors and the rest
that could either use Condor IO for file transfers, or a shared      of the community. They can post ideas about improving the
staging storage element (such a GridFTP or S3 server) for            tool, and submit bug reports when a tool fails or produces
storing intermediate data products. Workflows could then span        incorrect results. All of these capabilities are provided not just
multiple compute resources. Today, most of the hub workflows         for one community or one site, but for 40+ sites built on the
are using a simple site catalog describing either DiaGrid or         HUBzero platform. This improvement is bringing workflow
Open Science Grid as the target compute resource, and no jobs        automation and high-throughput computing to thousands of
are bound to particular resources of those grids. The result of a    users around the world who will benefit from it without ever
simpler site catalog, a catalog, which describes the various         realizing the complexity.
available resources and the services they provide, and the
improved data management in Pegasus 4.0, is that the                                              REFERENCES
researcher or workflow developer can more easily hand off            [1]   E. Deelman, G. Singh, M-H Su, J. Blythe, Y. Gil, C. Kesselman, G.
workflows to the underlying infrastructure and worry less                  Mehta, K. Vahi, G.B. Berriman, J. Good, A. Laity, J.C. Jacob, D.S.
about where the jobs will be running.                                      Katz, Pegasus: a Framework for Mapping Complex Scientific
                                                                           Workflows onto Distributed Systems, Scientific Programming Journal,
    Another integration challenge was how to provide better                Vol 13(3), pages 219-237, (2005).
feedback to the user about the execution of the workflow. The        [2]   D. Thain, T. Tannenbaum, M. Livny, "Distributed Computing in
interface to Pegasus is mostly command line tools, and even                Practice: The Condor Experience" Concurrency and Computation:
though there are tools like pegasus-status (which provides
     Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-             Nanotechnology, Computing in Science and Engineering, 10(5), pp. 17-
     April, 2005.                                                                    23, September/October, 2008.
[3] T. Oinn, M. Greenwood, M. Addis, et al., Taverna: lessons in creating a     [16] M. McLennan, R. Kennell, HUBzero: A Platform for Dissemination and
     workflow environment for the life sciences. Concurrency Computat.:              Collaboration in Computational Science and Engineering, Computing in
     Pract. Exper., 18: 1067–1100 (2006). doi: 10.1002/cpe.993                       Science and Engineering, 12(2), pp. 48-52, March/April, 2010.
[4] B. Ludäscher, I. Altintas, C. Berkley, et al., Scientific workflow          [17] Statistics reported online at http://nanohub.org/usage, accessed March
     management and the Kepler system. Concurrency Computat.: Pract.                 14, 2013.
     Exper., 18: 1039–1065, (2006). doi: 10.1002/cpe.994                        [18] Apache Project, Subversion, http://subversion.apache.org, accessed
[5] P. Kacsuk, Z. Farkas, M. Kozlovszky, G. Hermann, A. Balasko, K.                  March 14, 2013.
     Karoczkai, Istvan Marton, WS-PGRADE/gUSE Generic DCI Gateway               [19] OpenVZ Linux Containers, http://openvz.org, accessed March 14, 2013.
     Framework for a Large Variety of User Communities, J Grid Computing
                                                                                [20] T. Richardson, Q. Stafford-Fraser, K.R. Wood, A. Hopper, Virtual
     10:601–630 (2012). DOI 10.1007/s10723-012-9240-5
                                                                                     network computing, Internet Computing, IEEE , vol.2, no.1, pp.33,38,
[6] J. Goecks, A. Nekrutenko, J. Taylor, T.G. Team, Galaxy: a                        Jan/Feb 1998, doi: 10.1109/4236.656066
     comprehensive approach for supporting accessible, reproducible, and
                                                                                [21] S. Maddineni, J. Kim, Y. El-Khamra, S. Jha, Distributed Application
     transparent computational research in the life sciences. Genome Biol,
     11(8), R86, (2010).                                                             Runtime Environment (DARE): A Standards-based Middleware
                                                                                     Framework for Science-Gateways, J Grid Computing, 10:647–664
[7] J.L. Brown, C.S. Ferner, T.C. Hudson, et al., GridNexus: A Grid                  (2012). DOI 10.1007/s10723-012-9244-1
     Services Scientific Workflow System, International Journal of Computer
                                                                                [22] M. Pierce, S. Marru, R. Singh, A. Kulshrestha, K. Muthuraman, Open
     Information Science (IJCIS), Vol 6, No 2; June 20, 2005: 72-82.
                                                                                     Grid computing environments: advanced gateway support activities,
[8] C.A. Goble, D.C. De Roure, myExperiment: social networking for                   Proceedings of the 2010 TeraGrid Conference, TG ’10, pp. 16:1–16:9.
     workflow-using e-scientists. In Proceedings of the 2nd workshop on              ACM, New York (2010)
     Workflows in support of large-scale science (WORKS '07). ACM, New
                                                                                [23] F. McKenna, G.L. Fenves, M.H. Scott, OpenSees: Open system for
     York, NY, USA (2007). DOI=10.1145/1273360.1273361
                                                                                     earthquake engineering simulation. Pacific Earthquake Engineering
[9] R. Pordes, et al., The open science grid, J. Phys.: Conf. Ser. 78, 012057        Center, University of California, Berkeley, CA., http://opensees.
     (2007).                                                                         berkeley. edu (2006).
[10] Purdue University, DiaGrid, http://diagrid.org, accessed March 14, 2013.   [24] F. McKenna, C. McGann, P. Arduino; J.A. Harmon (2012), "OpenSees
[11] Extreme Science and Engineering Discovery Environment (XSEDE),                  Laboratory," https://nees.org/resources/openseeslab, accessed March 14,
     http://www.xsede.org, accessed March 14, 2013.                                  2013.
[12] G. Singh, K. Vahi, A. Ramakrishnan, G. Mehta, E. Deelman, H. Zhao,         [25] C. Thompson, B. Raub, BLASTer, http://diagrid.org/resources/blastgui,
     R. Sakellariou, K. Blackburn, D. Brown, S. Fairhurst, D. Meyers, G.B.           accessed March 14, 2013.
     Berriman. Optimizing Workflow Data Footprint, Special issue of the         [26] S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman, Basic local
     Scientific Programming Journal dedicated to Dynamic Computational               alignment search tool, Journal of Molecular Biology, 215 (3): 403–410,
     Workflows: Discovery, Optimisation and Scheduling, 2007.                        October 1990. doi:10.1016/S0022-2836(05)80360-2.
[13] S. Miles, E. Deelman, P. Groth, K. Vahi, G. Mehta, L. Moreau,              [27] R. Ihaka, R. Gentleman, R: A language for data analysis and graphics,
     Connecting Scientific Data to Scientific Experiments with Provenance,           Journal of Computational and Graphical Statistics, Vol. 5, Iss. 3, 1996.
     Third IEEE International Conference on e-Science and Grid Computing
                                                                                [28] R. Campbell, SubmitR, https://diagrid.org/resources/submitr, accessed
     (e-Science 2007) 10-13 December 2007 in Bangalore, India
                                                                                     March 14, 2013.
[14] S. Miles, P. Groth, E. Deelman, K. Vahi, G. Mehta, L. Moreau,
                                                                                [29] L.W. Nagel, D.O. Pederson, SPICE: Simulation program with integrated
     Provenance: The Bridge Between Experiments and Data, Computing in
                                                                                     circuit emphasis. Electronics Research Laboratory, College of
     Science & Engineering Volume:10 Issue:3 May-June 2008 Page(s):38-
                                                                                     Engineering, University of California, 1973.
     46.
                                                                                [30] J. Frey, T. Tannenbaum, I. Foster, M. Livny, S. Tuecke, "Condor-G: A
[15] G. Klimeck, M. McLennan, S.P. Brophy, G.B. Adams III, M.S.
     Lundstrom, nanoHUB.org: Advancing Education and Research in                     Computation Management Agent for Multi-Institutional Grids.," Cluster
                                                                                     Computing, vol. 5, pp. 237-246, 2002.