=Paper=
{{Paper
|id=None
|storemode=property
|title=ICAT Job Portal: a generic job submission system built on a scientific data catalog
|pdfUrl=https://ceur-ws.org/Vol-993/paper6.pdf
|volume=Vol-993
|dblpUrl=https://dblp.org/rec/conf/iwsg/FisherPR13
}}
==ICAT Job Portal: a generic job submission system built on a scientific data catalog==
<pdf width="1500px">https://ceur-ws.org/Vol-993/paper6.pdf</pdf>
<pre>
   ICAT Job Portal: a generic job submission system
           built on a scientific data catalog

               Stephen M Fisher                                Kevin Phipps                             Daniel J Rolfe
        Scientific Computing Department              Scientific Computing Department                Central Laser Facility
        Rutherford Appleton Laboratory               Rutherford Appleton Laboratory             Research Complex at Harwell
            Didcot, OX11 0QX, UK                         Didcot, OX11 0QX, UK                  Rutherford Appleton Laboratory
         Email: dr.s.m.fisher@gmail.com               Email: kevin.phipps@stfc.ac.uk              Didcot, OX11 0QX, UK
                                                                                                Email: daniel.rolfe@stfc.ac.uk

   Abstract—The value of metadata to the scientist is well known:           Each entity has a small agreed set of attributes. To make
with the right choice of metadata, data files can be selected           the system extensible, parameter types can be defined and
very quickly without having to scan through huge volumes of             associated with one or more of the entity types. Actual
data. The ICAT metadata catalog[1] (which is part of the ICAT           parameters of those types can then be associated with the
project[2]) allows the scientist to store and query information         corresponding entities. For example a parameter type could
about individual data files and sets of data files as well as storing
provenance information. This paper explains how a generic
                                                                        be defined for current measured in milliamps or elapsed time
job management system, exposed as a web portal, has been                measured in seconds.
built on top of ICAT. This gives the scientist easy access to a             Further entities Application, Job, InputDataset and Output-
high performance computing infrastructure without allowing the          Dataset allow the provenance of datasets to be stored within
complexities of that infrastructure to impede progress.
                                                                        the catalog, such that it is possible to trace the derived dataset
    The aim was to build a job and data management portal               back through a chain of applications and intermediate datasets
capable of dealing with batch and interactive work that would           to the original raw dataset.
be simple to use and that was based on tried and tested, scalable,
and preferably open source technologies. For the team operating             ICAT is implemented as a SOAP based web service using
the portal, it needed to be generic and configurable enough so          the mechanisms provided by the Java Persistence Architecture
that they can, without too much effort, modify their software           (JPA) to connect to a relational database. ICAT has rule
to run within the portal, add new software, and create new              based authorization and a powerful query language which is
dataset types and parameters. Modifications to existing software        translated into the JPA query language (JPQL).
should be limited to saving and loading their datasets in a slightly
different way so that instead of just being saved to disk, they are         The data files are not stored within ICAT itself, but are
registered within the system along with recording any provenance        stored within an ICAT Data Service (IDS)[4] as explained
information.                                                            below.

                       I.   I NTRODUCTION                               B. IDS the ICAT Data Service
                                                                            This is a component, defined by its interface which is able
    The ICAT Job Portal (IJP)[3] builds upon the tried and
                                                                        to store files and register their metadata in ICAT. It makes use
tested ICAT data catalog, an existing component written
                                                                        of ICAT for authorization. If ICAT allows the file metadata
specifically to catalog datasets produced by scientific facilities.
                                                                        to be written then the IDS will allow the file to be written.
It uses ICAT as the central database component which also
                                                                        Control of who can read follows the same pattern.
provides authorization via a flexible rules based system. This
means that users will only be shown datasets readable by them,
and any datasets produced whilst using the Job Portal will also                               II.   BACKGROUND
be protected by relevant permissions.                                   A. Use Case
   While developing a prototype portal to meet the needs of                 This work was motivated by a request from the Lasers
one group it became apparent that it could be made generic              for Science Facility (LSF) of the UK Science and Technology
and configurable enough to be used by a wide range of teams             Facilities Council (STFC) to help them with their data. The
within the scientific community.                                        LSF operates the OCTOPUS imaging cluster[5], a central
                                                                        core of lasers coupled to a set of advanced interconnected
A. ICAT the metadata catalog                                            microscopy stations that can be used to image samples from
                                                                        single molecules to whole cells and tissues. They had accu-
    ICAT is a data catalog specifically aimed at scientific facili-     mulated a large number of data files stored in a directory
ties into which data are stored based on the following hierarchy        structure. They had both a range of applications to process
of entities: Facility, Investigation, Dataset and Datafile. The         and visualise that data[6] and an interactive program with an
“Facility” produces the data for a group of users associated            easy to use GUI that would scan through a selected part of the
with an “Investigation”. Within the investigation “Datafiles”           file system to collect information in memory about their data
are grouped into “Datasets”.                                            then offer lists of raw datasets and lists of processed datasets
and offer the ability to process those datasets with a fixed set of                           User's PC
interactive jobs. The main problem with this solution was that                              Web         Remote
it was not scalable; the user had to restrict himself1 to a small                         browser       Desktop
                                                                                                         client
part of the available data each time the GUI was launched
                                                                                              https            rdp
as the program had to scan the data afresh each time it was
started which took time proportional to the volume of data.
In addition the user needed a machine allocated to him with
a personal account on that machine to allow him to run his                           ICAT Job Portal webapp
                                                                                                                 Assign
                                                                                                                             Prepare
                                                                                                                               job
                                                                                                                                            Facility
                                                                                                                                           software
work. This machine was hidden from off site users by a firewall                     Submit batch job           interactive
                                                                                                                   job         Run batch job

requiring his presence on site or the use of a VPN. Relieving                        Torque batch server
                                                                                                                 Assign
                                                                                                                             Torque worker node
the bottlenecks of data, job and user management would enable                              Head Node
                                                                                                                 batch
                                                                                                                                 Worker Node 1           Worker Node n
                                                                                                                  job
a significant improvement to the user experience and enable
more effective exploitation of the OCTOPUS facility.
    After development of a prototype solution it was realised                   Fig. 1.     Architecture overview
that there was a need for a generic solution so that LSF could
quickly and easily add new dataset types and job types without
needing to go back to the developers to make coding changes.                                                                                           Metadata
                                                                                          JEE Application Server                                       database
Our funders also favoured a generic solution that could be
deployed for other facilities which led to formulating a set of                                        ICAT
requirements some of which are listed in the next section.                                                                                               File
                                                                                                                                                       storage
                                                                                                       IDS

B. Requirements                                                                                     Job Portal                                           Jobs
                                                                                                                                                       database

    Following analysis of the prototype the requirements were
refined. Some of the key requirements are listed below.
                                                                                                                                           XML Job Descriptions
    1)     System accessible via both GUI and command line                                                                                 and Job Dataset
           from on and off site.                                                          Torque batch server                              Parameters
    2)     All the systems should have automated installation of
           OS and software updates.
    3)     Centralised user/group management.                                                                         Head Node
    4)     A file server must be able to store raw data from
           microscopes, analysed data and other user data. All                  Fig. 2.     The head node
           data must be backed up and “old” data migrated with
           an easy mechanism to restore it when needed.
    5)     All data should be managed with a single point to                       IBM’s Platform Application Center[8] provides a means to
           consult the metadata to find out what is where.                      describe jobs in XML and submit them however though it does
    6)     Ability to upload and download data.                                 meet requirement 10 it fails 4 and 5.
    7)     The ability to submit batch jobs to a set of Linux                       The Galaxy portal[9] is quite close to meeting our require-
           nodes, some with CUDA GPU capability. Listing,                       ments and also provides workflow support. Its main drawback
           cancelling and retrieving output from jobs must also                 is that it describes itself as a genomics workbench and as
           be supported.                                                        such is too focused on one discipline. This paper also contains
    8)     The ability to run interactive GUI based analy-                      interesting comparisons with other genomics workbenches.
           sis/visualisation jobs able to access data.
    9)     Select and submit multiple datasets for processing                       As we have a good metadata catalog: ICAT, and a matched
           through applications. This must cover both multiple                  data service: the IDS, we decided to build directly on those
           jobs with one dataset per job, or a job which will                   components.
           process all selected datasets.
  10)      Any menus must be configurable, as must the types                                            III.         S YSTEM A RCHITECTURE
           of datasets that can be stored, jobs that can be run                     The architecture shown in Fig. 1 is based around a single
           and job parameters associated with a job type.                       head node acting as a central point for all communications and
                                                                                an extensible number of worker nodes which can be added to
                                                                                in the future in order to increase the job handling capacity of
C. Possible solutions
                                                                                the system.
   Consideration was given to OMERO[7]; however this is                             The head node which is shown in more detail in Fig. 2
more suited to viewing and performing simple analysis of                        hosts an application server (currently Glassfish) running the
images rather than the specialised analysis codes developed                     Job Portal, ICAT and IDS software and acts as the head node
by LSF and it it does not meet requirement 10.                                  for a batch system (currently Torque[10]).
   1 Gender specific terminology should be interpreted as non-gender specific     Worker nodes have this role within the batch system but
throughout this paper.                                                          may also be assigned temporarily to a user for interactive work.
They should be capable of running all the facility software that    E. Command Line Interface
users require and they are able to communicate with ICAT for
                                                                        With the addition of a RESTful web service on the server,
metadata and with the IDS for data, both of which run on the
                                                                    a Python client has been provided to allow interaction with
head node.
                                                                    the Job Portal via a command line interface. Both of these are
                                                                    very thin layers totalling only a few hundred lines of code. This
A. Batch jobs                                                       provides an alternative to the GUI interface which may prove
                                                                    to be the preferred way for more proficient users to interact
    It is essential that a batch job belonging to one user cannot
                                                                    with the portal, and would be the interface of choice for anyone
access the account of any other user. To achieve this a batch
                                                                    looking to write a script to handle their data processing.
job is submitted to run on an account chosen randomly from
a pool. Each worker node is configured to run a very small
number of concurrent jobs. The job has a prologue which is          F. Automated Configuration
run before the user’s job. This tries to get a lock by creating         The installation, configuration and upgrading of the soft-
a directory to ensure that two jobs cannot run simultaneously       ware has been set up using the Puppet Open Source[11]
under the same account. If it fails it will issue a return code     framework. This means that, starting with computers with an
that causes the job to be requeued. The epilogue, which is          operating system and configured to use the network, it is
run after the job, frees the lock if it is run by the same job      possible to install the head node within an hour and each
that created it. The batch pool should be sufficiently large that   worker node can be added in a few minutes. The result is
requeuing is rare. There is a mechanism to tidy up if things        a working system including the Java Development Kit, a
go wrong.                                                           Glassfish Application Server (running ICAT, IDS and the Job
                                                                    Portal software), database servers and required databases, batch
B. Interactive Jobs                                                 system, monitoring (Ganglia and Nagios) and the scientific
                                                                    software provided by the team operating the portal.
    Although most batch systems do have some kind of interac-
tive job capability we found it convenient to provide the desired             IV.    C REATION AND U SE OF M ETADATA
functionality outside the batch system. For these jobs, the most
lightly loaded worker node is found, any running batch jobs             The use of metadata is essential to the operation of the
are suspended, the node is made temporarily unavailable for         IJP. Because it is a generic tool, the portal itself is not able to
new batch jobs, and the user is given exclusive use of that node    look inside domain specific datasets. It is entirely reliant on
to run the interactive job. To achieve this, the user is supplied   the metadata inserted into the ICAT database, and uses only
with a username taken from a pool and a temporary password,         this metadata for searching and displaying information.
allowing a remote desktop connection to be established via the          When an instrument produces data this is typically written
RDP protocol to the worker node. This will typically be either      to a local file store from which they can be ingested into the
via the Remote Desktop Connection application in Windows            IJP system. The best people to define the metadata to associate
or using the rdesktop command on Linux systems. The account         with this raw data are the team conducting the experiment.
will have been configured such that the interactive job that the    An IJP job can be submitted each time that data need to be
user has requested will start automatically. The user is only       ingested. This job must be able to derive the metadata from
given a short time to connect to the worker node machine            the available information and upload the data files themselves
before the password is removed. Once the user has logged out        to the IDS as well as creating entries in the ICAT database for
the system will remove the account, along with any local files      the metadata.
that may be left, and will release any suspended jobs that were
on the machine and make the machine available to the batch              When data are processed by an IJP job this results in new
system again.                                                       data and metadata being stored. It is the responsibility of the
                                                                    job to identify useful pieces of metadata to allow datasets to
                                                                    be subsequently selected. As it is difficult to identify all the
C. Ganglia monitoring                                               metadata that might eventually be useful, jobs can be written
    All nodes within the system are configured to make use of       to look at the data and add metadata to ICAT to hold more
the Ganglia Monitoring System. Currently this is being used         information about existing datasets.
to select the most lightly loaded machine in the cluster when           The three categories of job described here: ingestion,
an interactive job is requested. It allows a single XML stream      derivation of processed data and augmentation of metadata are
from the Ganglia host on the head node to be parsed, giving         all just jobs for the IJP and must be installed by the facility
an instant overview of the loading of each machine. Nagios          for its users.
monitoring is also installed but it is not an essential part of
the system.
                                                                                V.   A U SER ’ S V IEW OF THE P ORTAL

D. Job Status Information                                               Users access the job portal via a web browser as shown
                                                                    in Fig. 3. This was developed in Java using the Google
    The batch system is not well suited for holding job status      Web Toolkit[12] and communicates with a number of servlets
information for an extended period. In addition the portal needs    running on the application server on the head node. Once
to hold information about jobs that are not known to the batch      logged in, the user is presented with a number of search
system. Therefore the portal maintains its own records and          options tailored to the user base of the portal, and a generic
periodically harvests information from the batch system.            search widget listing all of the dataset parameters that are
Fig. 3.   Screenshot of IJP


searchable. The widget provides relevant search options for         batch system, checking the output and error logs if required
each parameter: =, !=, >, >=, <, <=, LIKE and BETWEEN               and monitoring the status until the job is complete.
depending on its type - string, numeric or date/time. The list
of parameters and their types is read from the underlying ICAT          As well as handling interactive and batch jobs, the portal is
database so that the portal software remains generic.               able to handle jobs that take either a single dataset or multiple
                                                                    datasets as input. Users are able to select multiple datasets and
    Within ICAT all datasets have to be of a type which has         the portal uses the job definition to work out whether to submit
been pre-defined before the dataset is registered. This allows      multiple jobs each with a single dataset as input, or a single
for easier searching of datasets. Once a user has selected the      job with multiple datasets as the input. Where it is ambiguous,
type of dataset in which they are interested, they can narrow       the user is asked to confirm what was intended.
down their search if they wish using the search options, then
click search. A list of matching datasets then appears in the           Datasets remain registered within ICAT and available via
central panel. When one of these datasets is selected, all of its   the IDS. They are suitably protected via a rule based per-
dataset parameters are displayed in the lower panel. Having         mission system which should have been configured to ensure
selected a dataset, the central Options select box lists all of     that users can at least read the data they have created. These
the jobs that it is possible to run on that dataset type. After     data will remain within the system. Should the user wish to
selecting the desired job, a Job Options Form is displayed          download a copy of their data, this is possible via “Download”
allowing the user to pass particular parameters to the job, if      in the Options select box. There is also an option to display a
required. This form is automatically generated from an XML          URL to obtain the dataset from the IDS.
file defining the job within the system, as shown in Fig. 4.
The options displayed can also be tailored so that only options         VI.   A N A DMINISTRATOR ’ S V IEW OF THE P ORTAL
relevant to the chosen dataset are offered.
                                                                        Configuration of the portal is defined by XML files. Each
    Submitting the form results in the job being submitted to       team using the portal to run their software needs to have at least
the server and the user receiving a response containing the ID      one person who is familiar enough with the team’s software
assigned to the job in the batch system. The user can then use      and the datasets it uses, to be able to set up each piece of
the Job Status tab to follow the progress of the job through the    software so that it can be run as a “job” by the portal. Firstly,
Fig. 4.   Configuration of job options


there are two fairly straightforward tasks which need carrying          Within the XML specifying each of the command line
out:                                                                options for a job, as shown in Fig. 4, a condition can be
                                                                    specified in terms of the named quantities defined in the
    •      picking out the characteristics of each dataset type     XML which if met, causes this option to appear on the Job
           which lead to different options being made available     Options Form. This takes the form of a logical expression such
           in the Job Options Form.                                 as numChannels == 3 && numHdfFiles > 500. If
                                                                    multiple datasets are selected, only the options that are com-
    •      creating an XML file describing each piece of soft-      mon to all of those datasets are offered to the user.
           ware: whether it runs as an interactive or batch job,
                                                                        In addition to setting up these XML descriptor files, there
           which type of datasets it needs, whether it accepts
                                                                    is a certain amount of work that needs to be done in order
           multiple input datasets, along with all of the various
                                                                    to make the team’s existing software compatible with the job
           command line options that it accepts.
                                                                    portal. This can be done either by modifying the existing
                                                                    applications or by providing job wrappers to perform tasks
    These two tasks are linked by the concept of Job Dataset        such as obtaining data from the IDS and laying it out as the
Parameters. For each type of dataset, an XML file is set up         program expects, storing resulting datasets back in the IDS and
allowing the administrator to define a named quantity and how       recording provenance information. Python libraries are being
it may be derived from an ICAT query. While the query can           established to simplify these operations - there is a generic
span all information the logged in user is allowed to see, a        library and we recommend using a facility specific library that
query might reasonably take into account information from           knows the facility conventions for layout of data.
the metadata associated with the dataset or any of its files
and might make use of JPQL aggregate functions SUM, AVG,
                                                                                      VII.   C URRENT S TATUS
MIN, MAX and COUNT.
                                                                         Having developed a prototype to prove the concept and
    The administrator has thus defined named quantities spe-        help the users to define the features that they need from the
cific to a dataset type and derivable by an ICAT query;             IJP, we are currently completing the work and plan to have a
examples include: the number of files of a particular type          first deployment for production use in a few months time.
or the size of the largest file in a dataset. When a dataset
is selected within the browser, the server runs the database                      VIII.   F UTURE D EVELOPMENTS
queries specified within the relevant XML file, generates a
map of name-value pairs and sends it back to the browser to            We anticipate that requirements will be clarified further
control what appears in the Job Options Form.                       once we receive feedback from users of the deployed pro-
duction system. Based on existing feedback we are already            control the system and the jobs it is running. This will support
planning the following enhancements.                                 common tasks such as monitoring job distribution and loading
                                                                     on the worker nodes, pausing and terminating jobs, taking
A. Visualisation of Provenance                                       worker nodes offline and bringing them back online, user and
                                                                     group administration, modification of authorization rules and
    Provenance information is stored within ICAT when a new          removal of unwanted datasets.
dataset is stored but there is currently no way to visualise
this information within the portal. A new panel will be added        E. Alternative remote desktop mechanism
to represent the provenance information in a graphical format.
This will allow the user to select the dataset they are interested       A possible alternative to using either Remote Desktop
in and expand it to see the input and output datasets and files      Connection in Windows or rdesktop on a Linux system is to
associated with it. Those datasets and files can, in turn, be        have the remote desktop session also run within the browser.
selected and expanded to follow the chain of provenance.             Currently, the RDP server port needs to be accessible on
                                                                     each of the worker nodes which is not a problem within the
   A further development would be the addition of a prove-           local site network. The system is, however, intended to be
nance based search facility which would allow searches such          used remotely from other institutions, which may contravene
as all datasets derived from a given dataset or all datasets         security policies. Having the possibility of running the Remote
produced directly or indirectly by a specific version of an          Desktop session via https within a browser may be the solution.
application.
                                                                         One solution of interest to solve this problem is
B. Workflow Support                                                  Guacamole[16], an HTML5 client-less remote desktop. It
                                                                     supports remote desktop protocols such as VNC and RDP,
    It would be a particularly useful feature to have the Job        and is able to deliver a remote desktop within a web browser
Portal integrated with a Workflow Management System. This            without the need for any browser plugins or client software
would make it possible to set off a chain of data processing         installation.
jobs with the output of the first job becoming the input
to later jobs, and so on. As the job relating to each stage          F. Alternative batch system
of the process completes, the next job in the workflow is
automatically submitted on behalf of the user.                           We only support Torque as a batch system. We plan to
                                                                     include Maui as a scheduler because the inbuilt Torque sched-
    One workflow management system which would be of                 uler (pbs sched) is very basic. Maui would enable scheduling
particular interest would be Taverna[13]. It is open source,         policies to be defined to allow more control of which job is
domain independent and written in Java, and therefore should         selected to be run when a slot becomes free.
integrate well with the server side of the portal software which
is also written in Java. Taverna has already been used behind            We also plan to make the choice of batch system config-
a portal[14] by a number of projects which demonstrates its          urable. The batch system might even act as a front-end to a
suitability to being used in this way.                               grid or cloud solution. We already have a request to support
                                                                     IBM Platform LSF[17].
C. Software as Data
                                                                     G. Portability
    Initially our preferred solution for deploying facility soft-
ware was to use the native packaging system of the operating            The Puppet configuration is only available for Ubuntu[18]
system - typically RPMs or DEBs. While convenient for the            and has only been tested on version 12.04 (64 bit). This is a
IJP developers this may not meet the needs of some of our            concern for existing infrastructures which are not able to easily
users who would like more freedom to run the software of their       accommodate these decisions. We plan to make the system
own choosing without arranging to have it officially installed       easy to install on other platforms and to support alternative
and who want to have multiple versions of the software               subcomponents where practical.
available. Following a suggestion[15], we are considering the           We have a request to support Red Hat Enterprise Linux[19]
implications of storing a job as a dataset known to ICAT.            version 6 (64 bit) and will probably include CentOS[20]
A job wrapper would then first download the application              version 6.4 (64 bit) at the same time.
software before setting up the data and running the downloaded
application. This would probably require some kind of caching                             IX.    C ONCLUSION
mechanism and would require a means of specifying software
dependencies to ensure that the correct packages are available          We have successfully built a job portal for ICAT users
for the desired software. This solution, which will require some     on top of the basic metadata catalog and the IDS. The initial
operations to be run as root to install dependencies, needs          prototype was very valuable as it allowed us to get something
careful evaluation.                                                  out quickly to ensure that we were on the right track and to
                                                                     understand what needed generalising.
D. Administration console                                                Though the generalisation was not a trivial task; the result
                                                                     is a tool that we believe is now very easy to configure for
    The system is currently rather opaque to the administrator
                                                                     many scientific disciplines.
and requires the use of the native batch system commands
to find out what is going on. We plan to provide a browser               The IJP allows rapidly changing, mature and wrapped
based web application allowing administrators to monitor and         “legacy” software to be made available, side by side, with
a uniform and modern style of interface to a scientific com-                    [6]   D. Rolfe, C. McLachlan, M. Hirsch, S. Needham, C. Tynan,
munity.                                                                               S. Webb, M. Martin-Fernandez, and M. Hobson, “Automated
                                                                                      multidimensional single molecule fluorescence microscopy feature
   We already have a number of groups from the existing                               detection and tracking,” European Biophysics Journal, vol. 40, no. 10,
ICAT community interested in the project and we anticipate a                          pp. 1167–1186, 2011. [Online]. Available: http://dx.doi.org/10.1007/
                                                                                      s00249-011-0747-7
good uptake of the software.
                                                                                [7]   The OMERO website. [Online]. Available: http://www.openmicroscopy.
                                                                                      org/site/products/omero
                        ACKNOWLEDGMENT                                          [8]   The IBM Platform Application Center. [Online]. Available:
                                                                                      http://www.ibm.com/support/entry/portal/documentation expanded
    The authors would like to thank Dave Clarke from STFC’s                           list/software/platform computing/platform application center
Lasers for Science Facility for supporting this work and                        [9]   J. Goecks, A. Nekrutenko, J. Taylor, and The Galaxy Team, “Galaxy:
attracting funding.                                                                   a comprehensive approach for supporting accessible, reproducible,
                                                                                      and transparent computational research in the life sciences,” Genome
    We would like to acknowledge the assistance and funding                           Biology, vol. 11, no. 8, p. R86, 2010. [Online]. Available:
from STFC’s Harwell Imaging Partnership, which has sup-                               http://genomebiology.com/2010/11/8/R86
ported this development from inception (http://www.stfc.ac.uk/                 [10]   Adaptive Computing’s Torque website. [Online]. Available: http:
hip).                                                                                 //www.adaptivecomputing.com/products/open-source/torque/
                                                                               [11]   The Puppet Open Source website. [Online]. Available: https:
   The diagrams in this paper were produced by Noris                                  //puppetlabs.com/puppet/puppet-open-source/
Nyamekye.                                                                      [12]   The Google Web Toolkit website. [Online]. Available: https:
    Finally we thank our colleagues Brian Mathews, Alistair                           //developers.google.com/web-toolkit/
Mills and Erica Yang who all provided helpful comments on                      [13]   P. Missier, S. Soiland-Reyes, S. Owen, W. Tan, A. Nenadic, I. Dunlop,
                                                                                      A. Williams, T. Oinn, and C. Goble, “Taverna, reloaded,” in SSDBM
drafts of this paper.                                                                 2010, M. Gertz, T. Hey, and B. Ludaescher, Eds., Heidelberg,
                                                                                      Germany, June 2010. [Online]. Available: http://www.taverna.org.uk/
                             R EFERENCES                                              pages/wp-content/uploads/2010/04/T2Architecture.pdf
                                                                               [14]   Taverna: Behind a portal. [Online]. Available: http://prototype.taverna.
[1]   The ICAT Metadata Catalog website. [Online]. Available: http:                   org.uk/introduction/taverna-in-use/portal/
      //code.google.com/p/icatproject/
                                                                               [15]   Rich Wareham, Cambridge, private communication, 2012.
[2]   The ICAT project website. [Online]. Available: http://www.icatproject.
      org/                                                                     [16]   The Guacamole website. [Online]. Available: http://guac-dev.org/
[3]   The ICAT Job Portal website. [Online]. Available: http://code.google.    [17]   IBM Platform LSF. [Online]. Available: http://www.ibm.com/systems/
      com/p/icat-job-portal/                                                          technicalcomputing/platformcomputing/products/lsf/index.html
[4]   The ICAT Data Service website. [Online]. Available: http://code.         [18]   The Ubuntu website. [Online]. Available: http://www.ubuntu.com/
      google.com/p/icat-data-service/                                          [19]   Red hat enterprise linux. [Online]. Available: http://www.redhat.com/
[5]   D. T. Clarke, S. W. Botchway, B. C. Coles, S. R. Needham,                       products/enterprise-linux/
      S. K. Roberts, D. J. Rolfe, C. J. Tynan, A. D. Ward, S. E. D.            [20]   CentOS: The Community ENTerprise Operating System. [Online].
      Webb, R. Yadav, L. Zanetti-Domingues, and M. L. Martin-Fernandez,               Available: http://www.centos.org/
      “Optics clustered to output unique solutions: A multi-laser facility
      for combined single molecule and ensemble microscopy,” Review of
      Scientific Instruments, vol. 82, no. 9, p. 093705, 2011. [Online].
      Available: http://link.aip.org/link/?RSI/82/093705/1

</pre>