Bringing Scientific Workflow to the Masses via Pegasus and HUBzero Michael McLennan, Steven Clark Frank McKenna HUBzero Platform for Scientific Collaboration University of California, Berkeley Purdue University Berkeley, CA, USA West Lafayette, IN, USA fmckenna@ce.berkeley.edu mmclennan@purdue.edu, clarks@purdue.edu Derrick Kearney, Carol Song Ewa Deelman, Mats Rynge, Karan Vahi Purdue University Information Sciences Institute West Lafayette, IN, USA University of Southern California dsk@purdue.edu, carolxsong@purdue.edu Marina del Rey, CA, USA deelman@isi.edu, rynge@isi.edu, vahi@isi.edu Abstract— Scientific workflow managers are powerful tools manager (DAGMan) within HTCondor [2], were built to for handling large computational tasks. Domain scientists find it manage thousands of jobs in a high-throughput computing difficult to create new workflows, so many tasks that could environment. Taverna [3] graphically connects bioinformatics benefit from workflow automation are often avoided or done by web services together into a coherent flow. Kepler [4] also hand. Two technologies have come together to bring the benefits provides graphical user interfaces for workflow composition of workflow to the masses. The Pegasus Workflow Management and supports different workflow execution models. The System can manage workflows comprised of millions of tasks, all SHIWA simulation platform uses the WS-PGRADE/gUSE the while recording data about the execution and intermediate technology to provide workflow interoperability across a results so that the provenance of the final result is clear. The number of systems [5]. Recently, Galaxy [6] is also gaining HUBzero platform for scientific collaboration provides a venue popularity in the bioinformatics domain, particularly because it for building and delivering tools to researchers and educators. With the press of a button, these tools can launch Pegasus tailors its environment to their community. workflows on national computing infrastructures and bring Once a particular workflow has been established, it is results back for plotting and visualization. As a result, the relatively easy for others to execute it. However, creating new combination of Pegasus and HUBzero is bringing high- workflows has proven to be an enormous challenge— throughput computing to a much wider audience. especially for domain scientists with very little background in computer science. Creating a workflow is a lot like writing a Keywords—computation; workflow; collaboratories; user interfaces; automation program that calls standard library functions. The author not only must have a deep understanding of the science but also I. INTRODUCTION must understand the interface for each component of the workflow, including the data files required and the output files Computation has brought about a revolution in science. As produced. The output of one component may need format software was created to analyze large amounts of data, conversions or additional processing before it can be used by researchers found they could gather even more data, which another component. The author must know what programs are necessitated even more software and more computing power. available to perform conversions, or in some cases, the author Today, many analysis tasks require not just a single program or must write little conversion programs (often called “shims”) to computer, but dozens of different programs running across complete the workflow. Components of the workflow may thousands of computational nodes. Handling any one of those have limitations or may fail when given certain combinations tasks requires a complex orchestration of moving data to of input parameters. The author must understand that too, and appropriate nodes, finding or staging the executables, starting work around such issues. jobs, handling data flow dependencies, and overcoming job failures. Several projects have tried to simplify the creation of workflows by building drag-and-drop editors. Taverna [3], Over the past decade, several frameworks have been Kepler [4], Galaxy [6], and GridNexus [7] all allow users to created to support the execution of such large computational drag blocks onto a canvas and connect inputs and outputs tasks. The Pegasus Workflow Management System (Pegasus together in a flow. But while these applications simplify the WMS) [1] and its workflow engine, the directed acyclic graph task of expressing a workflow, they do not solve the inherent This work was supported by the US National Science Foundation through grants CBET-0941302, CMMI-0927178, OCI-1148515, and OCI-0943705. problems of understanding components, building shims, or workflow mapping process involves finding the appropriate working around limitations. The Taverna project has taken one software, data, and computational resources required for step further and created a web site at MyExperiment.org for workflow execution. The mapping process can also involve publishing and sharing workflows [8]. restructuring the workflow to optimize performance and adding transformations for data management and provenance Our approach takes workflow yet another step further by information generation. DAGMan, Pegasus’s workflow engine, delivering live simulation tools embodying workflows to the relies on the resources (compute, storage and network) defined masses via a web browser. These tools are powered by in the executable workflow to perform the necessary actions. Pegasus WMS running on Open Science Grid [9], DiaGrid Individual workflow tasks are managed by a task scheduler [10], and XSEDE resources [11]. The tools are created by the (Condor), which supervises their execution on local and remote few skilled users within a community, but wrapped in graphical resources. user interfaces with integrated plotting and visualization, and delivered to the masses via the HUBzero software platform. III. HUBS FOR SCIENTIFIC COMMUNITIES There are now more than 40 hubs covering a wide range of scientific disciplines, including nanotechnology, earthquake A. Building a Community on nanoHUB.org mitigation, manufacturing, healthcare, pharmaceuticals, In 2002, the US National Science Foundation created the volcanoes, and climate change. There is also a hub for the Network for Computational Nanotechnology (NCN), a HUBzero community at http://hubzero.org, which includes a collection of universities engaged in simulation and modeling list of known hubs. All together, these hubs have served more for the nanotechnology community [15]. NCN established a than 850,000 visitors from 172 countries worldwide—not over web presence at nanoHUB.org and offered an array of all time, but during the calendar year 2012 alone. Pegasus and simulation and modeling tools, along with seminars, tutorials, HUBzero act as a conduit for these users, giving them fingertip courses, and other supporting materials. Over the years, access to national Grid computing infrastructures. nanoHUB.org evolved into a software platform supporting collaboration in private group areas, discussion forums, ratings II. THE PEGASUS WORKFLOW MANAGEMENT SYSTEM and reviews, and most importantly, deployment of new The Pegasus Workflow Management System (or Pegasus) simulation tools via its unique middleware. In 2007, the manages the execution of scientific workflows on desktops, underlying software for nanoHUB.org was spun off as a private clusters, campus clusters, grids, and academic and separate project, the HUBzero® Platform for Scientific commercial clouds. It automatically locates the necessary input Collaboration [16]. Since then, it has been used to create data and computational resources needed for workflow similar hubs for many other scientific areas—all supported by execution; thus, it allows scientists to specify their workflows the same underlying middleware and content management at a high-level of abstraction (devoid of resource information) system. and then maps/plans this description onto the available nanoHUB grew from 10 tools and 1,000 users in 2002 to execution resources. Next, Pegasus reliably executes this plan. more than 260 tools and 250,000 users today [17]. User As a result, Pegasus workflows are easy to compose and are contributions were key to that growth. Each new tool, seminar, portable across heterogeneous cyberinfrastructure. or tutorial brought new users to the site, some of whom Pegasus is capable of executing workflows on a single contributed their own new content. This virtuous cycle was resource or across resources. Even a single workflow fueled by an open content management system that enabled description can be distributed and executed across the users to upload and deploy their own content according to their cyberinfrastructure. Pegasus has been used to run workflows own schedule, with minimal intervention from the nanoHUB ranging from just a few computational tasks up to millions. team. When errors occur, Pegasus tries to recover when possible by Uploading and publishing a single document or PowerPoint retrying tasks, by retrying the entire workflow, by providing presentation is one thing; uploading and deploying a simulation workflow-level checkpointing, by re-mapping portions of the tool is quite another. Tools have hundreds of individual source workflow, by trying alternative data sources for staging data, code files; they must be compiled and tested within the hub and, when all else fails, by providing a rescue workflow environment; and all of this must be done by untrusted users containing a description of only the work that remains to be with minimal intervention. The HUBzero infrastructure was done [1]. Thus, it tries what it can to shield the user from designed to support this activity. Researchers upload their code errors and intermittent issues with the computing resources. It into private project areas with a Subversion [18] repository for cleans up storage as the workflow is executed so that data- source code control; they compile, test, and modify their code intensive workflows have enough space to execute on storage- within a secure execution container managed by OpenVZ [19] constrained resources [12]. Pegasus keeps track of what has running on a cloud of execution hosts; and they approve tools been done (provenance) including the locations of data used for deployment via a Web-based content management system. and produced, and which software was used with which All of this enables their code to be separated from the system parameters [13],[14]. and from other users, and yet deployed in a secure and scalable In order to support the abstract workflow specifications, manner. which let scientists concentrate on their science rather than on End users find a tool on a hub and press a button to launch the operational aspects of the cyberinfrastructure, mapping it. The tool is launched in a Linux/X11 environment within a technologies are needed to automatically interpret and map secure container on the “cloud of execution hosts,” and the user-defined workflows onto the available resources. The graphical output from the session is piped through the Web to the NSF George E. Brown Network for Earthquake the user’s browser via VNC [20]. The result is a live, Engineering Simulation (NEES) moved operations to Purdue interactive simulation tool running in a secure environment, and created a hub for the civil engineering community. Today, capable of running workflows on national computing NEES.org offers more than 65 simulation and data analysis infrastructure, and accessible to any user via a Web browser tools used to understand the damage caused by earthquakes and without any special software installation. improve building design. One of these tools leverages an open source code, the Open System for Earthquake Engineering There are many other gateway frameworks, including the Simulation (OpenSees) [23], to provide a collection of utilities Distributed Application Runtime Environment [21], Open Grid for structural and geotechnical engineers. Computing Environments [22], and WS-PGRADE/gUSE [5], but none provide direct access to interactive tools and scientific One of the utilities within this OpenSees Laboratory tool visualization. This is a distinct advantage of the HUBzero [24] is the Moment Frame Earthquake Reliability Analysis. platform. This tool is intended to demonstrate to engineers the importance of uncertainty in their models and the need to use There are many different users that work together within probabilistic methods when determining important response each hub community, and their interactions are mediated by the measures. The tool prompts the user for the material properties HUBzero platform. A hub owner creates a new hub and including probability distributions, building layout, and then a engages others in the community to seed the site with initial list of earthquake ground motion records. On submission, the content. Users browse the site and take advantage of tool automatically builds and executes a Pegasus workflow. interesting content, perhaps by watching a seminar or running a The workflow manages several hundred to tens of thousands of simulation tool. Users also contribute their own content to the separate OpenSees jobs on the Open Science Grid (OSG), site, including new workflows and simulation tools. Hub which together perform a reliability analysis of the building administrators help to compile and stage new simulation tools frame under seismic loads, as shown in Fig. 1. The resulting and ultimately approve all content before it is released. plots will show the variation in roof displacement, interstory B. NEES.org and OpenSees Laboratory drift, and base shear given the uncertainty in the materials for Other hubs leverage the same HUBzero infrastructure to each of the earthquake motions specified. A single dot in red support different toolsets for their own community. In 2009, for each earthquake shows the engineer what would happen if Fig. 1. The OpenSees Laboratory on NEES.org, like all tools managed by HUBzero, uses a graphical user interface to prompt for input values and plot results. Tool sessions run within secure containers on an execution host, and can submit individual runs or Pegasus workflows out to remote sites. Fig. 2. The BLASTer tool on DiaGrid.org manages the execution of BLAST programs on DiaGrid. NCBI sequence databases are kept up to date on shared storage, so they do not need to be transferred for each run. only the mean value had been used for the material properties DiaGrid.org offers a similar tool for statistical analysis for that earthquake. The advantage of using Pegasus and the scripts written in the R programming language [27]. It is quite OSG for such a workflow is that Pegasus recognizes that most common for researchers to run a particular script over a wide of the computation can be performed in parallel and the OSG range of input parameters to explore the parameter space or to permits concurrent job submission and execution. As a perform a sensitivity analysis. The SubmitR tool [28] takes a consequence, the resulting time spent obtaining these results script, data files, and a specification of parameter values, and can be a few minutes and hours instead of many hours, weeks automatically builds and executes the Pegasus workflow to run and even months that would be required if the results were a large number of R jobs in parallel on DiaGrid. Again, this obtained from a series of sequential analyses. appeals to the domain scientist who is quite familiar with R, but knows nothing about workflows, Condor, or DiaGrid. To date, OpenSees Laboratory has served 950 simulation users from the US, China, Canada, Italy, Japan, and other IV. CREATING SCIENTIFIC WORKFLOWS countries all over the world. About a third of these users are coming from academic institutions. All of them are driving A. Job Submission via “submit” OpenSees through a graphical user interface, perhaps without The integration of Pegasus and HUBzero certainly makes it even knowing if and when they are launching complex easier for end users to launch workflows, but also makes it workflows. easier for tool developers to build and test new workflows. Each hub comes with one tool called a Workspace, which is a C. DiaGrid.org and BLASTer, SubmitR full-featured Linux desktop accessible via the Web. Tool DiaGrid is a high-throughput computing resource utilizing developers use this to upload, compile, and test their code the Condor system, with more than 50,000 cores from before deploying a tool. Experienced researchers also use this machines at Purdue University and 9 other campuses. Since its to get past the constraints of a graphical interface—to edit files, inception in 2005, DiaGrid has handled more than 90,000,000 write simple scripts, and dispatch computational jobs. simulation jobs. HUBzero provides a command called “submit,” which is In 2011, Purdue established a hub at DiaGrid.org to provide used within the Workspace to dispatch jobs, as shown in Fig. 3. more widespread access to the underlying DiaGrid execution In this example, the command line “spice3 –b circuit” pool. DiaGrid.org offers a tool called BLASTer [25], which represents the research code that is being executed. Electrical can be used to run BLAST (Basic Local Alignment Search engineers will recognize this as the SPICE3F4 program created Tool) [26], a bioinformatics tool for analyzing DNA sequence back in the 1970’s at UC Berkeley and still in use today [29]. data. Users upload a query file containing one or more But the approach we describe would apply to any such research nucleotide sequences and search against standard databases code amenable to command line execution. The command from the National Center for Biotechnology Information spice3 is the executable, the flag –b means to execute in (NCBI), or against their own custom databases. The run for a batch mode, and the file circuit contains the netlist single sequence may take anywhere from 10 seconds to 30 representing the electrical circuit being simulated. minutes. But many researchers bundle thousands of searches into a single request. Performed linearly, such a search might When invoked directly on the command line as in Fig. 3(a), take several days or weeks of computation. BLASTer the spice3 program runs locally within the execution host automatically divides the query file into chunks and creates a managing the tool session (see Fig. 1). However, prefixing the Pegasus workflow to distribute the search over hundreds of usual command line with submit will send the job off to DiaGrid nodes, so the end-to-end execution time can be remote execution venues, such as DiaGrid or the Open Science shortened from weeks to a matter of hours. Shared storage Grid. The “-v DiaGrid” arguments shown in Fig. 3(b) allows the NCBI standard databases, updated regularly, to be request that the job be sent to DiaGrid. The submit command available to all users without requiring file transfer, as shown in will automatically transfer files such as circuit that are Fig. 2, thereby further reducing the time to results. required for simulation. It will speak the appropriate protocol—currently Condor, Condor-G, Load Leveler, LSF, (a) spice3 –b circuit (b) submit –v DiaGrid spice3 –b circuit (c) submit –p @@res=100,1k,10k spice3 –b @:circuit (d) submit –p @@res=100,1k,10k –p @@cap=1u,10u,100u spice3 –b @:circuit (e) submit –d indata.csv spice3 –b @:circuit (f) submit –p @@num=1:100 spice3 –b circuit@@num (g) submit –p @@file=glob:circuit* spice3 –b @@file (h) submit pegasus-plan --dax myworkflow.dax Fig. 3. HUBzero’s submit command makes it easy to send jobs off to remote sites, and includes support for parameter sweeps and Pegasus workflows. SGE, PBS, or SLURM—to queue the job. And, it will monitor “spice3 –b circuit2”, and so forth, referencing the series progress and transfer results back to the execution host once of numbered files that the user created by hand for this run. the run is complete. It runs all jobs with a common credential Input files can also be recognized using glob-style pattern owned by the hub, so individual users need not have an account or any credentials for the remote venue. The hub tracks all matching. The arguments “-p @@file=glob:circuit*” in Fig. 3(g) define a parameter named @@file that takes on all jobs, including those run on remote venues, so that if a security incident arises, a complete report can be generated for any values matching the glob-style pattern circuit*. With this syntax, the files could be distinguished by different numbers, affected venue. The same accounting is used to produce monthly reports of the overall simulation usage for each hub. different letters, or other naming conventions, but all files that submit can find matching circuit* will be simulated as B. Simple Workflows: Parameter Sweeps separate cases, with @@file representing the whole name of The submit command can also manage parameter sweeps each matching file, such as circuit1, circuit-2b, with a very simple command line, as shown in Fig. 3(c). circuitXYZ, etc. The -p flag indicates that the next argument is a parameter C. Customized Workflows in Pegasus specification. In this case, the parameter @@res is taking three values: 100, 1k, and 10k. The @: prefix before the circuit More experienced users can learn how to create their own customized Pegasus workflows. Each workflow is expressed file indicates that this file should be processed with parameter substitutions. In other words, the submit command treats as a high-level directed acyclic graph (DAG) called a DAX (for DAG XML description). Each node in the DAX represents a circuit as a template file, and substitutes all occurrences of @@res with a particular parameter value. In this case, it job in the workflow, and the edges between nodes represent data files that flow from one job to the next. The DAX is produces three separate input files—one for each of the parameter values. Then, it automatically builds a Pegasus similar to the Condor DAGMan file [30] but at a higher level of abstraction so that it can be targeted to different execution workflow to manage the jobs, and dispatches the workflow off to an appropriate venue. Since the optional –v argument was environments. not specified in this case, the submit command will Each DAX is ultimately expressed as an XML file. Writing automatically select a venue that supports Pegasus job that file by hand, however, can be tedious and impractical. submission and data transfer interfaces, such as DiaGrid or Many developers prefer to express their workflow as small Open Science Grid. program written in Python, Java, or Perl. Pegasus provides Application Programming Interfaces (APIs) in these languages A simulation run may have multiple parameters. In Fig. 3(d), the parameter @@res is swept through three values, to help construct the needed XML. The workflow designer needs to build up the nodes and edges of a DAX object, and the and the parameter @@cap is swept through three values. The submit command creates 9 different input files representing API is used to write out the object in XML format. Running the program, therefore, produces a DAX file expressing the all possible combinations of these two parameters, builds a Pegasus workflow to manage the jobs, and then dispatches the entire workflow, suitable for submission. workflow to an appropriate venue. The submit command described earlier is also used to Instead of specifying input parameters on the command launch Pegasus workflows, as shown in Fig. 3(h). This time the programs being launched (such as spice3) do not appear on line, the values for all combinations of parameters can be the command line; instead, they are encoded within the DAX stored in a comma-separated value (CSV) file. In Fig. 3(e), the arguments “-d indata.csv” indicate that parameter values description of the workflow, which in this case is a file named myworkflow.dax. The submit command invokes another should be loaded from the file indata.csv. The first line should contain the parameter names for the columns, such as program called pegasus-plan which reads the DAX, “compiles” the workflow for a specific execution venue, then “@@res,@@cap” in this example. Each of the remaining lines represents the parameter values for a single job, such as launches the workflow and manages execution. “100,1u” or “10k,100u” for this example. Passing For example, the Moment Frame Earthquake Reliability parameters in this way is useful when the desired values are not Analysis tool, discussed in Section III.B, uses both the Python a simple combination of values, but perhaps a series of values API provided by Pegasus and the submit command. At the generated by a genetic optimization algorithm, or by a Latin launch of the simulation by the user, the tool invokes a shell Hypercube sampler. script that first invokes a Python script (opensees-dax.py) Instead of relying on template files and parameter to create the XML file and then invokes the submit command using the pegasus-plan program and the XML file to run the substitutions, the user may generate a series of separate input files. For example, suppose a user has 100 different circuit workflow on the OSG, as follows: files to evaluate. The syntax shown in Fig. 3(f) supports this #!/bin/sh scenario. The arguments “-p @@num=1:100” define a python opensees-dax.py –-nMat=$1 –-nMotion=$2 > dax.xml parameter @@num that takes on integer values from 1 to 100. In this case, the circuit input file is specified as circuit@@num, submit pegasus-plan --dax dax.xml so the submit command builds a Pegasus workflow with 100 different command lines, substituting the value of @@num for The workflow compilation or “planning” step is the each job to produce commands like “spice3 –b circuit1”, strength of the Pegasus approach. It optimizes the workflow by removing any branches where partial results are in place and where dependencies are already satisfied. It also augments the including seminars, tutorials, teaching materials, and most workflow, adding tasks to stage data files in/out of remote importantly, computational tools. Any researcher within a resources, and to clean up unnecessary results after execution. particular hub community can click on the Upload link, fill out And, it can partition the workflow and cluster short-running a form describing their new tool, and get immediate access to a jobs together, so that they run more efficiently on a single private project area complete with a Subversion source code compute node. It does all of this at a high-level, freeing the repository and a wiki area for project notes. The researcher can user from having to request or identify specific resources. launch the Workspace tool, check out the latest source code, During execution, Pegasus dispatches jobs, monitors progress, and compile and test within the workspace environment. The and automatically retries failing jobs so that sporadic failures researcher can build Pegasus workflows and test them right at do not derail the entire workflow. If a workflow does fail (for the command line by typing the command shown earlier in example, if the venue goes down, or if a particular job triggers Fig. 3(h). Once that part is working, the researcher can create a a bug in the code), it can be corrected and restarted by the user. graphical interface using Rappture or any other toolkit, and Pegasus will recognize partial results and pick up where it left embed the same submit command within the code of their off to complete the workflow. new tool. V. BUILDING AND DEPLOYING APPLICATIONS Once the tool is working properly, the researcher can visit the page representing their tool contribution and click a link A. Graphical User Interfaces and the Rappture Toolkit saying, “My code is committed, working, and ready to be Researchers can build their own graphical user interfaces installed.” This signals the hub administrators to check out the on top of the basic submit capability, so that other (perhaps latest code, compile it, and install it in the official /apps less experienced) users can access a simplified interface and directory under a subdirectory with the tool name and its launch powerful workflows for targeted applications, as revision number from the source code control system. described earlier in Section III. Such tools can be created using Once a tool is installed, the researcher receives a message MATLAB®, Java, Qt, GTK+, wxWidgets, or any other toolkit to begin testing. Clicking a Launch Tool button brings up a that runs under Linux/X11. For example, the interfaces for preview of the final tool, exactly as other users of the hub BLASTer and SubmitR were both created using Java and would see it. Once the tool has been verified, the researcher Python, respectively. can click a link saying, “My tool is working properly. I approve HUBzero’s Rappture toolkit provides an easy way to create it.” such graphical interfaces. Rappture comes pre-installed within That prompts the hub administrators to take one last look, the Workspace tool and includes an interactive “builder” to verify that the tool is indeed functional and has an adequate application for new tools. Researchers specify the inputs and description page, and then move the tool to the “published” outputs for each new tool by dragging objects from a palette of state, where it becomes available to other users according to the available controls and by dropping them into an interface restrictions set by the tool authors. A tool can be open to the specification tree. Each object is given a label, description, a world, or protected by export control, or accessible only by a default value, and other attributes, such as units of measure for certain group of users. If a tool is published as Open Source, numeric values. The builder provides a preview of the an archive of the source code is presented for download resulting interface and generates errors and warnings if any alongside the Launch Tool button, and the terms of the Open elements are missing or incomplete. The builder saves each Source license are clearly displayed on the tool page. All tools interface in an XML description file, and it can also generate a have a unique Digital Object Identifier (DOI), so they can be skeleton for the main program of the tool in a variety of cited within academic publications. programming languages, including C/C++, Fortran, MATLAB, Octave, Java, Python, Perl, R, Ruby, and Tcl/Tk. The C. Ongoing Tool Support researcher modifies the body of the skeleton code to include The publication of a tool is not the end of the story. Other the core of the tool, then runs the rappture command to users may encounter a bug in the tool and file a support ticket produce the graphical interface. on the hub. Any ticket filed on a live tool session page is However a graphical interface is created, it acts as an automatically routed to the tool authors and becomes their intermediary for the user, gathering input values, launching responsibility to fix. The authors receive email about the simulation runs, and plotting results. Remote jobs and ticket, and they can see it listed on the My Tickets module workflow submissions are handled via something like the C within their My Account dashboard page. They can update the language system() function, which forks and executes a ticket to communicate with the user, check out the code within separate shell process to handle a command line string with the a Workspace, commit a fix, and close the ticket. When the submit command shown earlier in Fig. 3. Scripting languages code is ready for another release, the authors click a link are particularly well suited for this task, so it is quite common saying, “I've made changes Please install the latest code for to build an interface tool in Python, for example, which calls testing and approval.” This takes the tool back to the submit to dispatch a complex workflow built from a series of “updated” state, causing hub administrators to stage the latest C/C++ and Fortran programs. version, letting the tool authors test and approve the latest version, leading to an updated tool publication. B. Tool Publication Process Each tool also has its own question/answer forum for HUBzero includes a powerful content management system community discussion. Another user might not understand the for uploading and deploying many different types of content, physics within the tool and post a question. One of the tool a progress report), the question was how to provide the authors, or perhaps another user, might follow up with an information back through the tool to the end user. In particular, answer. Researchers can have threaded discussions within the we ran into a problem where planning a workflow was taking a question/answer area, and they earn points for participation. long time—on the order of 30 minutes! During this time, there The points can be redeemed for merchandise on some hubs, was no feedback to the user that the system was actually doing traded for other favors, or used as bragging rights. anything. We are still working on a solution to provide better output from the pegasus-plan operation, but once the End users may like a tool, but may have requests for workflow is planned and is running, the hub regularly runs improvement, such as support for extra model parameters or pegasus-status and provides feedback to the user. Once the material types. They can post such ideas on the wish list for workflow has finished, the pegasus-statistics and each tool. Other users can vote the ideas up or down. The tool pegasus-analyzer commands are run automatically to authors can evaluate each wish on a scale of importance and provide the user with summary statistics, and if something effort, so that wishes can be sorted in order of priority. Ideas failed, a report on what failed. Improving these integration that are deemed important and low effort bubble up to the top points is an ongoing effort for the HUBzero and Pegasus of the list, while those that are unimportant and high effort sink developers. to the bottom. When any wish is granted, the original user is notified and the tool author may earn points for fulfilling the Some integration issues required just simple code changes. request. One example is how a report generated by the pegasus-statistics tool was modified to better match the VI. INTEGRATION CHALLENGES data needed inside HUBzero. The changes included adding a During the integration work, we encountered some few columns to the job/host breakdown, and providing the data challenges including data management, feedback and workflow in comma-separated value (CSV) format so that the HUBzero progress in the user interface, and also providing workflow framework could easily import the data at the end of a metrics from Pegasus to the HUBzero framework. workflow run. When the integration effort started, the most recent Pegasus VII. CONCLUSION release was version 3. The data management in that version The integration of Pegasus into the HUBzero framework was still based on a model of high-performance computing has brought the power of automated workflows to many more (HPC) systems, with assumptions such as a shared parallel users. Expert users can launch a Workspace within a hub and filesystem being mounted on the compute nodes used for the gain immediate access to Pegasus and computational workflow. Under this data management model, when a resources—without having to install or setup the Pegasus workflow was submitted, a particular compute resource had to software, without having to apply for Grid credentials, and be chosen, the data had to be transferred in to the shared without having to learn complex Grid protocols. A simple filesystem, and after the jobs were run, the output data had to submit command dispatches any single job or complex be staged out. These steps were handled by Pegasus, but the workflow off to remote facilities for execution. Expert users fact that a resource had to be chosen up-front meant that either can experiment with such commands, use them in research the hub or the end user would have to select a resource to run activities, and wrap them into graphical user interfaces that are on. In the case of high throughput workloads, for which it does deployed as new tools for the community. not matter where the jobs run, selecting a resource was not a very user-friendly solution, since at any given moment, a Without knowing a thing about workflows or high- particular resource might be heavily loaded or broken, while throughput computing, others in the community can access other venues might be working fine. This particular integration such tools, enter their own input parameters, and launch a problem went away with the release of Pegasus 4.0. Two new complex workflow with the press of a button. If the users have data management approaches were added to support workloads questions, they can post them for the tool authors and the rest that could either use Condor IO for file transfers, or a shared of the community. They can post ideas about improving the staging storage element (such a GridFTP or S3 server) for tool, and submit bug reports when a tool fails or produces storing intermediate data products. Workflows could then span incorrect results. All of these capabilities are provided not just multiple compute resources. Today, most of the hub workflows for one community or one site, but for 40+ sites built on the are using a simple site catalog describing either DiaGrid or HUBzero platform. This improvement is bringing workflow Open Science Grid as the target compute resource, and no jobs automation and high-throughput computing to thousands of are bound to particular resources of those grids. The result of a users around the world who will benefit from it without ever simpler site catalog, a catalog, which describes the various realizing the complexity. available resources and the services they provide, and the improved data management in Pegasus 4.0, is that the REFERENCES researcher or workflow developer can more easily hand off [1] E. Deelman, G. Singh, M-H Su, J. Blythe, Y. Gil, C. Kesselman, G. workflows to the underlying infrastructure and worry less Mehta, K. Vahi, G.B. Berriman, J. Good, A. Laity, J.C. Jacob, D.S. about where the jobs will be running. Katz, Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems, Scientific Programming Journal, Another integration challenge was how to provide better Vol 13(3), pages 219-237, (2005). feedback to the user about the execution of the workflow. The [2] D. Thain, T. Tannenbaum, M. Livny, "Distributed Computing in interface to Pegasus is mostly command line tools, and even Practice: The Condor Experience" Concurrency and Computation: though there are tools like pegasus-status (which provides Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February- Nanotechnology, Computing in Science and Engineering, 10(5), pp. 17- April, 2005. 23, September/October, 2008. [3] T. Oinn, M. Greenwood, M. Addis, et al., Taverna: lessons in creating a [16] M. McLennan, R. Kennell, HUBzero: A Platform for Dissemination and workflow environment for the life sciences. Concurrency Computat.: Collaboration in Computational Science and Engineering, Computing in Pract. Exper., 18: 1067–1100 (2006). doi: 10.1002/cpe.993 Science and Engineering, 12(2), pp. 48-52, March/April, 2010. [4] B. Ludäscher, I. Altintas, C. Berkley, et al., Scientific workflow [17] Statistics reported online at http://nanohub.org/usage, accessed March management and the Kepler system. Concurrency Computat.: Pract. 14, 2013. Exper., 18: 1039–1065, (2006). doi: 10.1002/cpe.994 [18] Apache Project, Subversion, http://subversion.apache.org, accessed [5] P. Kacsuk, Z. Farkas, M. Kozlovszky, G. Hermann, A. Balasko, K. March 14, 2013. Karoczkai, Istvan Marton, WS-PGRADE/gUSE Generic DCI Gateway [19] OpenVZ Linux Containers, http://openvz.org, accessed March 14, 2013. Framework for a Large Variety of User Communities, J Grid Computing [20] T. Richardson, Q. Stafford-Fraser, K.R. Wood, A. Hopper, Virtual 10:601–630 (2012). DOI 10.1007/s10723-012-9240-5 network computing, Internet Computing, IEEE , vol.2, no.1, pp.33,38, [6] J. Goecks, A. Nekrutenko, J. Taylor, T.G. Team, Galaxy: a Jan/Feb 1998, doi: 10.1109/4236.656066 comprehensive approach for supporting accessible, reproducible, and [21] S. Maddineni, J. Kim, Y. El-Khamra, S. Jha, Distributed Application transparent computational research in the life sciences. Genome Biol, 11(8), R86, (2010). Runtime Environment (DARE): A Standards-based Middleware Framework for Science-Gateways, J Grid Computing, 10:647–664 [7] J.L. Brown, C.S. Ferner, T.C. Hudson, et al., GridNexus: A Grid (2012). DOI 10.1007/s10723-012-9244-1 Services Scientific Workflow System, International Journal of Computer [22] M. Pierce, S. Marru, R. Singh, A. Kulshrestha, K. Muthuraman, Open Information Science (IJCIS), Vol 6, No 2; June 20, 2005: 72-82. Grid computing environments: advanced gateway support activities, [8] C.A. Goble, D.C. De Roure, myExperiment: social networking for Proceedings of the 2010 TeraGrid Conference, TG ’10, pp. 16:1–16:9. workflow-using e-scientists. In Proceedings of the 2nd workshop on ACM, New York (2010) Workflows in support of large-scale science (WORKS '07). ACM, New [23] F. McKenna, G.L. Fenves, M.H. Scott, OpenSees: Open system for York, NY, USA (2007). DOI=10.1145/1273360.1273361 earthquake engineering simulation. Pacific Earthquake Engineering [9] R. Pordes, et al., The open science grid, J. Phys.: Conf. Ser. 78, 012057 Center, University of California, Berkeley, CA., http://opensees. (2007). berkeley. edu (2006). [10] Purdue University, DiaGrid, http://diagrid.org, accessed March 14, 2013. [24] F. McKenna, C. McGann, P. Arduino; J.A. Harmon (2012), "OpenSees [11] Extreme Science and Engineering Discovery Environment (XSEDE), Laboratory," https://nees.org/resources/openseeslab, accessed March 14, http://www.xsede.org, accessed March 14, 2013. 2013. [12] G. Singh, K. Vahi, A. Ramakrishnan, G. Mehta, E. Deelman, H. Zhao, [25] C. Thompson, B. Raub, BLASTer, http://diagrid.org/resources/blastgui, R. Sakellariou, K. Blackburn, D. Brown, S. Fairhurst, D. Meyers, G.B. accessed March 14, 2013. Berriman. Optimizing Workflow Data Footprint, Special issue of the [26] S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman, Basic local Scientific Programming Journal dedicated to Dynamic Computational alignment search tool, Journal of Molecular Biology, 215 (3): 403–410, Workflows: Discovery, Optimisation and Scheduling, 2007. October 1990. doi:10.1016/S0022-2836(05)80360-2. [13] S. Miles, E. Deelman, P. Groth, K. Vahi, G. Mehta, L. Moreau, [27] R. Ihaka, R. Gentleman, R: A language for data analysis and graphics, Connecting Scientific Data to Scientific Experiments with Provenance, Journal of Computational and Graphical Statistics, Vol. 5, Iss. 3, 1996. Third IEEE International Conference on e-Science and Grid Computing [28] R. Campbell, SubmitR, https://diagrid.org/resources/submitr, accessed (e-Science 2007) 10-13 December 2007 in Bangalore, India March 14, 2013. [14] S. Miles, P. Groth, E. Deelman, K. Vahi, G. Mehta, L. Moreau, [29] L.W. Nagel, D.O. Pederson, SPICE: Simulation program with integrated Provenance: The Bridge Between Experiments and Data, Computing in circuit emphasis. Electronics Research Laboratory, College of Science & Engineering Volume:10 Issue:3 May-June 2008 Page(s):38- Engineering, University of California, 1973. 46. [30] J. Frey, T. Tannenbaum, I. Foster, M. Livny, S. Tuecke, "Condor-G: A [15] G. Klimeck, M. McLennan, S.P. Brophy, G.B. Adams III, M.S. Lundstrom, nanoHUB.org: Advancing Education and Research in Computation Management Agent for Multi-Institutional Grids.," Cluster Computing, vol. 5, pp. 237-246, 2002.