=Paper=
{{Paper
|id=None
|storemode=property
|title=Improving the Swiss Grid Proteomics Portal: Requirements and new Features based on Experience and Usability Considerations
|pdfUrl=https://ceur-ws.org/Vol-993/paper18.pdf
|volume=Vol-993
|dblpUrl=https://dblp.org/rec/conf/iwsg/KunsztBHSSWRERQM13
}}
==Improving the Swiss Grid Proteomics Portal: Requirements and new Features based on Experience and Usability Considerations==
<pdf width="1500px">https://ceur-ws.org/Vol-993/paper18.pdf</pdf>
<pre>
             Improving the Swiss Grid Proteomics Portal
       Requirements and new Features based on Experience and Usability Considerations


   Peter Kunszt, Lorenz Blum,                       Bernd Rinn, Franz-Josef Elmer,            Andreas Quandt, Lars Malmström
  Béla Hullár, Emanuel Schmid,                       Chandrasekhar Ramakrishnan                   Institute for Molecular Systems
  Adam Srebniak, Witold Wolski                      Center of Information Sciences and                         Biology,
  SystemsX.ch SyBIT, ETH Zurich                   Databases, ETH Zurich, Department of          ETH Zurich, Department of Biology
 and Swiss Institute of Bioinformatics             Biosystems Science and Engineering                    Zurich, Switzerland
         Zurich, Switzerland                       and Swiss Institute of Bioinformatics
                                                            Basel, Switzerland


    Abstract—We have received feedback from our users and              for all projects. SyBIT also provides and supports middleware
supporters on the functionality and usability of the Swiss Grid        to manage and catalog the large amounts of collected data.
Proteomics Portal during its first year of operation. We have also     Finally, SyBIT maintains and supports a whole toolbox of
realized which aspects of the portal could be improved upon            software to be sure that all project's needs are met. Most of the
through frequent monitoring and interaction with the production        middleware and tools are already well established community
system under heavy use. In a second, highly upgraded version of        standard tools and libraries, to which SyBIT contributes
the Swiss Proteomics Portal, called iPortal, we have introduced        wherever needed, improving the functionality using software
several new concepts based on this feedback and both user and          engineering in the process to the benefit of these communities.
supporter experience. In this paper we detail the requirements
                                                                       SyBIT also provides training to the research groups in the
and the improvements we have made, and also give an outlook on
future possible improvements.
                                                                       usage of these tools and the integrated research infrastructure.
                                                                           In a previous publication we have already described the
   Keywords—gateway; ease of use; portal; proteomics                   Swiss Grid Proteomics Portal [1], aiming to provide an easy-to-
                                                                       use but powerful portal for standardized proteomics data
                        I.    INTRODUCTION
                                                                       analysis. In this paper we elaborate on the experiences of
    Recent advances in observational technologies have turned          operating the portal, leading to new requirements and the
the Life Sciences into a data-intensive science. Microscopy            implementation of new features. A lot of the considerations
imaging, mass spectrometry, gene sequencing and other                  that led to better usability may be relevant to similar efforts,
technologies are available at a relatively low cost to the             and are summarizing our best practices for sustained operations
research labs, turning many labs into large data producers. The        of the improved proteomics portal, that we now call iPortal.
precision with which biological processes can be observed              The name was inspired by the ease of use of Apple’s products:
today provides the researchers with a very large amount of             we also want to give our users a fun experience and a self-
complex information, which has to be analyzed, processed and           explanatory portal interface.
understood. The relatively new field of Systems Biology aims
to integrate and model several scales of observational data of a             II.   ADDRESSING REQUIREMENTS IN PROTEOMICS
given biological system, which can be a cellular organism like             In several SystemsX.ch projects, proteomics data needs to
yeast or an organ of a larger organism like the wing of the fruit      be collected as part of the overall system biology analysis.
fly. The system is analyzed as a whole and models are built to         Mass spectrometry is used to identify and quantify the protein
understand its behavior. Due to the many layers of complexity          content of a given biological sample. The analysis of the raw
already involved, researchers are in need of specialized               data collected by the mass spectrometers is a research domain
assistance to deal with the complexity of the digital                  on its own, and there is a very large number of community
infrastructure involved. The SyBIT project of SystemsX.ch, the         tools available to reconstruct the protein content from the mass
Swiss National Initiative in Systems Biology, has been set up          spectra generated by the instruments. The complexity of
to provide this support to all research projects in the                proteomics data analysis is large, as the analysis itself depends
SystemsX.ch initiative, which involve over 200 research labs in        strongly on the sample being observed, and the biological
Switzerland. With the additional SystemsX.ch funding, many             question at hand. Until recently, all analysis of mass
new instruments were provisioned at the participating                  spectrometry data was done through a series of manual steps,
institutions. Several projects are producing raw observational         making use of individual command-line tools with different
data on a large scale, on the order of terabytes per instrument        parameters for the heterogeneous data analysis. First raw data
per month, week or in some cases daily. SyBIT collects the             formats produced by the various instruments need to be
requirements on data processing and works with the local and           converted to open standard formats. Then they are analyzed
central resource providers to make sure that the necessary             using one or several of the community tools available. Often
infrastructure is available for data storage and data processing       these tools were produced as parts of a research project, with
     This work was supported by the SystemsX.ch, the Swiss National
Initiative for Systems Biology and by the European Union Seventh
Framework Programme (FP7/2007-2013) under grant agreement no 283481
(SCI-BUS project).
poor adherence to standard data formats, so there is a lot of       6) There is a very large heterogeneity in input datasets that are
data transformation involved to assure that the output of one          needed for the protein identification workflows. It was very
tool can serve as the input of another. Scripts and specialized        cumbersome and error-prone to select different input
workflows have been built by the researchers and                       datasets.
bioinformaticians in proteomics to automate some of their
                                                                    7) Developers of completely new algorithms also need access
steps, but these were often not kept track of, or had hardcoded
elements for specific environments, people or projects, not            to a portal-like infrastructure. This kind of easy deployment
                                                                       of high-turnaround custom workflows was not enabled with
really intended or suitable to be reused by others
                                                                       the first portal.
    In SyBIT we needed to address several issues to enable
                                                                        In the context of the EU FP7 SCI-BUS project we have
reusable, traceable proteomics analysis workflows for large
amounts of proteomics data. First, we had to make sure that all     upgraded the portal to the next-generation technology using
                                                                    Liferay portal technology and the gUSE/WS-PGRADE
raw data is well tracked and annotated for future reference.
Data needs to be searchable based on criteria like project,         workflow managers [10,11]. By changing to a more modern,
                                                                    modular technology we could now start to address the issues
observer, timestamp, biological context and other user-defined
parameters. We had to also make sure that the raw data is           observed by the users of the first portal, extending and
                                                                    improving the portal. We could formulate the following high-
stored such that it can be retrieved easily for analysis and also
future re-analysis. For traceable, large-scale data management      level requirements based on the feedback above:
we are using the open biology information system openBIS [3],       1) Authentication and authorization: Each user needs to use
which we are continuously improving to support our                     their cluster account, to be requested and signed for
communities. Second, we need to find a way to process the              separately, adhering to local usage policy.
data using the various analysis tools available in the
community. We need to be able to adapt and change the data          2) The individual workflows need to be configurable to a
processing pipelines while keeping track of the steps and              much higher degree, giving much more possibilities to
parameters involved, to assure that the results are traceable and      customize the workflow.
reproducible. For this purpose we have built the Swiss Grid         3) Input and output management from and to the individual
Proteomics Portal [1], based on the P-GRADE grid portal                workflow nodes needs to be managed at a lower level.
system [2]. This first portal has been put to production in 2010.      Error reporting and logging needs to be standardized for the
                                                                       workflow system to be able to cope with the various failure
 III.   EXPERIENCE WITH THE FIRST PROTEOMICS PORTAL AND                modes.
                   NEW REQUIREMENTS
                                                                    4) The usability and intuitiveness of the portal needs to be
    We have made several observations and collected feedback
                                                                       heavily improved. Researchers not intimately involved in
from the users and the resource operators of the Swiss
                                                                       data processing algorithms need to be guided through the
Proteomics Portal by interacting with the users, either in direct
personal discussions or through email. We have presented the           process of selecting and configuring a workflow,
                                                                       associating their data with it and retreiving and registering
portal at several internal seminars and we have provided
training in its usage. We have also collected all the                  their results. Workflows need to be categorized by research
                                                                       topic. Also Monitoring of running workflows needs to be
requirements in our bug tracking system and have evaluated
their relevance regularly in user meetings. The requirements           simplified for the end-user, but an administrator should be
                                                                       able to dig down into the relevant logs in case of failures.
are as follows:
                                                                       Administrators need to see the logs of all running
1) Resource providers were not happy with the portal being             workflows to be able to invoke procedures to rescue failed
   operated under a single user name, not being able to                jobs also through the portal.
   distinguish who makes use of their resources, in our case
                                                                    5) If Grid Certificates are used, they should be invisible to the
   the local HPC cluster managers. New users are required to
                                                                       user.
   sign the cluster usage policies and rules, and this did not
   occur through the portal.                                        6) UniProt, SwissProt reference datasets used in protein
                                                                       identification workflows often need to be extended with
2) The existing workflows did not have enough
                                                                       specific proteins being searched in a given experiment.
   customizability for the end user. Also, setting of parameters
                                                                       These reference datasets need to be easily managed and
   was not straightforward.
                                                                       selected for identification workflows by the user.
3) Error tracing through the individual workflow steps was
                                                                    7) Workflow algorithm developers need either a pluggable
        extremely difficult.
                                                                       architecture to modify existing workflow nodes easily or a
4) Running and rerunning the workflows was not intuitive for           mechanism to submit workflows outside of the interactive
   the user and it was not straightforward for an administrator        portal.
   to see what went wrong to be able help the user out in a
   short timeframe. Debugging took too much effort.                         IV.   IMPLEMENTATION OF THE REQUIREMENTS
5) We also tried to use Grid Certificates, but too many users           For all of the requirements above we have implemented
   were not able to make use of them on their own.                  modules or extensions to the Proteomics Portal. We are calling
                                                                    the new Liferay/gUse/WS-PGRADE-based portal including all
these extensions iPortal to distinguish it from the previous P-      C. The Workflow Wizard
GRADE based Swiss Grid Proteomics Portal versions.                       The main change to the way the Swiss Grid Portal is now
A. Integrating with Cluster Authorization                            perceived by the end-user was the introduction of the
                                                                     Workflow Wizard in the iPortal. This has been implemented as
    The first requirement is to make use of individual user
                                                                     another Liferay portlet. The users can select the Workflow
accounts on the local cluster. Having individual user accounts
                                                                     Wizard as one of the top level tabs on the main page of the
was the requirement of the cluster operators at the ETH Zurich.
                                                                     portal. The Workflow Wizard guides the user through a series
Each user has to request an account on the local Brutus cluster
                                                                     of steps:
through the usual means, by filling out a web form and
agreeing to the terms of use. We have built a portlet into iPortal   1) Workflow type selection. The first step is to select the type
that comes into play whenever a new user is registered. This            of workflow that the user wants to run. Currently there are
portlet is activated at first time log-in of a new user.                two workflow types, search and quantification. The search
                                                                        workflow implements several flavors of the trans
    For each user, we create a new openssl secure public-               proteomic pipeline [6,7], that is used to identify the
private keypair. We store the private key in the secure portal
                                                                        peptides and proteins in the raw data as received from the
database. At the first-time login, the user is asked to log into        mass spectrometers in a proteomics experiment. The
the cluster (a popup window asks him or her to enter their
                                                                        quantification workflows make use of the result of a search
cluster username and password), and a session is established to         workflow and additional information based on historical
their cluster user account. The portal copies the new public key
                                                                        reference data, to quantify the abundance of the proteins
into their ssl directory as a new authorized user. Also, a new          with respect to one another in a given sample. Also here
configuration file is added that will be sourced whenever the
                                                                        there are several workflows that can be selected based on
portal is submitting jobs on the user’s behalf by making use of         the type of experiment and the type of sample being
their account.
    From now on, the interaction with the cluster is always
through the individual user’s accounts. Of course, the iPortal
users need to apply for and receive a cluster account before
being able to register for the iPortal. The popup window
requesting their cluster account credentials informs them of this
fact and provides the link to the cluster registration page.
Through this mechanism we can completely fulfill the
requirement for cluster registration and running cluster jobs
using the individual user accounts. We are in close
collaboration with the cluster administrators to make sure that
our security mechanisms are trusted and adhere to their
policies.
                                                                         analyzed. The picture below shows this first step as shown
B. Using Grid Certificates                                               in the iPortal.
    We have implemented a mechanism to make use of Grid
Certificates based on SAML assertions in a previous project          2) The second step selects the actual workflow of the given
called GridCertLib [4]. It can create a proper X.509 certificate        type. Each workflow comes with a name and a one-
based on the user’s AAI login, which is available to all                paragraph description, with a link to further information
researchers in Switzerland based on the SWITCH-AAI                      and detailed workflow description in the project wiki
countrywide service [5]. This would fulfill requirement 5), that        pages. All of these workflows have been created and tested
the users do not need to deal with Grid Certificates when               by workflow developers, by making use of the gUse
submitting to distributed infrastructures.                              workflow editor. Now the regular users do not need to
                                                                        interact with this editor anymore but are given the choice
    However, the GridCertLib implementation relies on the               among many predefined workflows. The user can choose
delegation feature of Shibboleth, which is a new feature and            one of the workflows by clicking on the corresponding
not yet available on the current infrastructure. All SWITCH-            ‘choose’ button.
AAI enabled institutions would need to upgrade their identity
provider service and would need to configure this service            3) The third step is the selection of the data that needs to be
accordingly. This has proven to be an insurmountable                    analyzed with the workflow. Depending what workflow
administrative hurdle for the past 2 years, unfortunately. As           was chosen, the user is given the right type of data that he
elegant as this solution is, it is not usable in practice. In the       or she has access to, presented in a table format. This list is
production Swiss Proteomics Portal, the users are therefore still       generated on the fly by submitting a query to the openBIS
expected to upload their proxy certificates to a myproxy server         information system [3], where all the data are indexed and
outside of the portal if they want to make use of Grid resources.       annotated with the relevant metadata. Users only see their
                                                                        own datasets or data they have been given access to by
    We are now exploring other technologies that could be used          others.
also in the context of cloud infrastructures, but they are not          We have decided to use openBIS as the data management
mature enough yet. So this requirement is, unfortunately, still         hub for all of the proteomics data already for the first
not met.
    version of the Swiss Proteomics Portal. Data that is being     5) Finally, the workflow is ready for submission. The user
    generated at the mass spectrometers is automatically              receives an overview of the workflow to quickly check
    ‘uploaded’ into openBIS: a monitoring process checks the          that everything is in order or whether changes need to be
    contents of the directory into which new datasets are             made, in which case the ‘back to previous step’ link can be
    generated, scans these for automatically available                used to go back to the corresponding step to change the
    metadata, registers the new data in the openBIS database          settings. If the workflow and its configuration are found to
    and moves the data files to the central data store from           be correct, the workflow can be submitted by clicking on
    where they can be made available to the data consumers.           the ‘submit’ button. The wizard makes use of the gUSE
    These can be processing steps as part of a workflow by            Application Specific Module (ASM) interface to select
    making use of the rich openBIS API or it can be accessed          and execute the predefined workflows through gUSE.
    interactively by the user over the web interface or directly
                                                                       Once a workflow was successfully submitted, the workflow
    using the APIs, for example in Matlab.
    The data registration is highly configurable: a python         wizard asks the user whether another workflow should be
                                                                   created with identical settings of the current workflow. This
    script can be customized to extract the available metadata
    from the raw data and to register it in openBIS. For the       was one of the requests we have received from the users, often
                                                                   they want to submit the same workflow several times but with
    researchers it is very convenient to see their newest mass
                                                                   different datasets, and this helps them to do so more quickly as
                                                                   they only need to select the dataset, all the other settings are
                                                                   remembered by the wizard and are provided as default settings
                                                                   for the next session. The Workflow Wizard was a very large
                                                                   improvement in terms of usability for our users, and has
                                                                   improved the acceptance of the iPortal. We are continuously
                                                                   extending the wizard with new workflows and are planning
                                                                   also new types of workflows.
                                                                       The Workflow Wizard addresses requirement 2), ie. the
                                                                   request for more customization possibilities in the workflow. In
                                                                   step 4, especially with the ability to store custom parameter
                                                                   sets, users can adjust every parameter of the workflow.
                                                                   Together with the new monitoring portlet described below, the
                                                                   very important usability requirement 4) has been addressed as
                                                                   well to a large degree.
                                                                   D. Improved Monitoring
                                                                        The workflow monitoring page as provided by WS-
    spectrometry run already cataloged and available soon          PGRADE has been perceived as very overloaded by many of
    after their sample has been processed by the instrument.       the proteomics portal users, and we have also experienced first-
    Users interact with openBIS through its web interface          hand that many users simply did not find the information they
    where they can browse and access the data. We have             were looking for. We have therefore decided to make use of the
    consciously decided to keep the data management                gUSE ASM interface again to provide a more intuitive view on
    interface separate from the processing interface over          the current state of the user’s workflows by implementing a
    iPortal, in order not to overload either one. Depending on     monitoring portlet. It can be accessed through another main tab
    what kind of workflow was chosen, the data selection can       on the portal at any time. In the initial view, the users see a list
    be a two-step process. In the case of quantification           of their workflows in a simple table, color coded whether they
    workflows for example, first an experiment context needs       are running, completed successfully or aborted with an error.
    to be chosen, then the datasets from within that experiment    Clicking on the one of the workflow lines will open a second
    can be specified which should be analyzed with the given       table below the first one, where the users get a detailed list of
    workflow. The next picture shows this data selection step.     all job types (workflow nodes) in the given workflow, with an
                                                                   indication of success or failure on this level. Again the user can
4) The fourth step is the parameterization of the workflow.        click on one of these items to get access to the detailed logs of
   Each workflow comes with a number of predefined                 that particular node of the workflow. Usually one is most
   parameterizations, provided by the workflow developers,         interested in the node that shows warnings or errors. The
   that are suitable for a various workflow usage scenarios.       detailed log view opens three new panes on the page,
   The users can create their own parameterization by un-          displaying the standard output, the standard error and the gUSE
   hiding the detailed parameter settings. They can then save      logs of the job. There are only very rare cases when this view is
   their own parameter sets under a new name, which can be         not sufficient to understand why a certain failure has occurred.
   used for future workflow parameterizations. These
   parameterizations may also be shared among users. For               However, as mentioned in the requirements, we have
   search workflows, the parameterization step also includes       realized early on that many users cannot extract the necessary
   the selection of the input database (see BioDB section          information from the logs to understand why a particular
   below).                                                         workflow has failed, simply because they do not know what to
                                                                   look for. Very often a user needs a supporter to help them to
browse the logs and to understand the root cause of the                  •       All messages and errors are written to configurable
problem. The reason for a failed workflow is often just a                        log files or the standard output and standard error.
random cluster node failure or data access issue, or a job that
ran out of time or memory for some reason unrelated to the job           •       The error messages themselves are standardized and
itself. More rarely we see wrong parametrizations, erroneous                     allow for automated error management in the future.
datasets or input data selections. In the case of cluster failures,      •       The ini files can be stored with the result of the
we have started to build in automatic resubmissions and retries,                 workflow for future reference, allowing for the
which are very common in such environments. Another                              complete tracking of the workflow for future
addition we made is a ’monitoring administrator’ role, which                     reference.
can be assigned to supporters. With such a role, a supporter not
only monitors his own workflows, but the workflows of all                 Using applicake wrappers [8], the gUSE workflow now
users in the portal. If workflows of a user fail, a monitoring        only needs to be configured with a single input and a single
administrator can check the logs of that user within his own          output port, specifying the configuration file of the wrapper.
monitoring view and take action to rescue the failed workflows        This simplifies also the construction of the workflows in the
directly. This way the users often do not even realize that           gUSE workflow editor, and allows for the collection of all
something went wrong and also the supporters do not need to           relevant messages in the standard output, standard error and
spend a lot of time trying to understand the issues over email        gUSE log files as displayed in the monitoring portlet.
indirectly. For failures that involve user error (like selecting          With applicake, we are addressing requirement 3) to a large
wrong parameters) the users of course are being contacted             extent. Applicake needs to be extended continuously, as new
directly. This mode of operation is much appreciated by the           workflows are being implemented with new node types, ie.
user community, and allows for a tight interaction between the        with executables that have not been wrapped yet with a
supporters and the users.                                             validator and using the ini file managing input-output data.
E. Workflow Node Wrapping                                             This seems like additional work at first, but it is worth the
                                                                      effort as it provides us with traceability as well as with unified
    The proteomics workflows are making use of many open
                                                                      validation and error handling and a better possibility for
source tools. All of them have different ways of managing
                                                                      testing.
input and output files and they also are not standardized on
how they report successful processing or errors. Some return          F. BioDB
with a nonzero error code, others write messages into standard            For the proteomics search workflows, reference data has to
output, standard error, or into specific log files. We have to        be available for the identification algorithms. The reference
parse these outputs, validate them for correctness and scan for       datasets are usually a subset of the publicly available
errors. Additionally, at each step there are output files in a        UniProt/SwissProt database, but there are also other custom
workflow node that usually serve as input to the next node.           reference datasets. These data need to be made available to the
Often these need some conversion and validation into the              search workflow jobs on the execution machines, but they are
suitable format for the next step.                                    quite large and therefore it makes sense to pre-stage them to a
    We can simplify the error parsing and error processing as         well-defined location for read-only access.
well as the complexity of the workflows by wrapping each                  The reference data is usually ‘enriched’ with special
executable serving as a workflow node with a python script            proteins of interest to the given experiment, with contaminants
that standardizes their behavior. Our wrapper, called applicake,      (like keratin, often found in human skin or hair that is often
implements the following new behavior for the workflow                found as contamination in the observed sample) and with so-
nodes:                                                                called decoy proteins, ie. protein sequences that do not exist in
   •       Each node has only one input file and only one             nature, which are later used to estimate false discovery rates
           output file. The output of the previous node serves        [21]. There are several algorithms available to build decoy
           as the single input file of the next node.                 proteins, one of the popular ones is to simply reverse the order
                                                                      of amino acids the sequence of an existing protein, as this will
   •       These files contain only metadata, ie. a set of key-       provide decoys with identical overall mass to real proteins.
           value pairs, describing the properties of the previous
           and current workflow nodes, their actual input                 In the case of UniProt, there is also a new release every
           parameters and data files. Currently, this is              month that needs to be downloaded from the UniProt
           implemented as a MS-Windows-like ‘ini’ properties          Knowledgebase server. We need to keep track of the different
           file, containing the relevant key value pairs grouped      versions of the UniProt datasets to be able to reproduce
           into sections.                                             previous results, or for larger experimental projects that would
                                                                      be too expensive to re-search every time a new UniProt version
   •       The nodes only extend the previous ini file, they do       is available.
           not remove data, leaving a trace of the whole
           workflow process.                                              Since every search workflow is run on different samples,
                                                                      very often specific reference datasets need to be constructed for
   •       The nodes all validate their input and output and are      the corresponding organisms, contaminants and the most
           exiting with standardized error codes or warnings.         suitable decoy algorithm. So we end up with a large number of
                                                                      flavors for the UniProt knowledgebase. We have realized early
                                                                      on that we need to provide an easy-to-use mechanism for users
to select their reference dataset in their workflow. Currently      (Liferay, gUSE and WS-PGRADE from GridSphere and P-
this is implemented as a drop-down list in step 4 of the            GRADE), we could make use of a modular architecture to
Workflow Wizard.                                                    improve existing components and interfaces and to implement
                                                                    several new parts at all layers of the portal.
   We provide an automated tool we call BioDB to regularly
download the UniProt Knowledgebase, provide a versioning                For the end-user we have created a workflow wizard where
based on the date of the download, and to enrich it with a          the user can select from several predefined workflows, but with
default set of contaminants and decoys, splitting it by the most    the capability to customize in detail all parameters of the
commonly used organisms in the iPortal.                             workflow, and to store and share parameter sets. For the
                                                                    supporters and portal administrators it was essential to see the
    The BioDB has four components: a download agent that
                                                                    monitoring information of all users so that they can quickly
fetches the original data from the public data providers, a         understand and fix problems, even before the users themselves
publication agent that makes the enriched dataset available, a
                                                                    realize them. For this, we have also introduced a node wrapper
subscriber agent that downloads and installs the data on the        framework to homogenize node input-output management, for
local resource and finally a central registry that keeps track of
                                                                    result validation and to unify error messages. Now the users
all publishers and subscribers. We are running a BioDB              receive better support and more meaningful error messages.
subscriber agent on the ETH Brutus cluster to assure that all
datasets are available on the cluster scratch filesystem in a           Finally, for the developers of new algorithms and
well-known location so that all search jobs can just access the     workflows, we have also provided new ways for quick
right reference dataset based on the user’s selection (which is     prototyping and made the porting of workflows straightforward
kept in the ini file provided through applicake). The users can     to the production portal.
run providers on their own custom dataset and register it in the
system. Such a design assures that BioDB is an independent              For future work, we can improve further on all aspects
                                                                    mentioned above. In terms of security, we need to find new
module that simply makes the necessary datasets available at
the right resource without any intervention from the user. With     ways to enable federated identity management frameworks, as
                                                                    currently we are still using certificates to access distributed grid
BioDB we address requirement 6).
                                                                    resources. We also want to be able to access public cloud
G. Workflow Development using Ruffus                                infrastructures through the gUSE DCI-Bridge interfaces as
    The final requirement was to allow workflow developers to       provided through the SCI-BUS project. In the Wizard we are
run workflows in a pluggable manner, also outside of the            continuously improving the intuitiveness based on user
portal. By making use of applicake and the Python Ruffus            feedback, changing the look and feel of the wizard in the
package [9], workflow developers can quickly test their             process.
workflows locally or directly on the cluster. Ruffus is a               In BioDB, we need to improve the management of the
lightweight workflow library that can deal with dependencies,       personal datasets that vary from user to user, as now the list of
parallelism and also provides some error handling. Once a           BioDB databases has grown too long. In the applicake
workflow has been sufficiently tested with Ruffus, it is very       framework we are looking to replace the ini files with the
straightforward to build a gUSE workflow. In fact, the              common tool description (CTD) format developed by the
developer of the original Ruffus workflow can usually hand          OpenMS team [16]. This would enable us also to make use of
over the code to a gUSE expert who has no difficulty to turn it     KNIME [17] instead of Ruffus for workflow development and
into a proper gUSE workflow. Ruffus is also very useful for         testing.
automated regression testing of the applicake nodes.
                                                                                          VI.   RELATED WORK
    With Ruffus we address the final requirement 7) to a
sufficient degree. Ruffus cannot be used for more complex               The Swiss Grid Proteomics Portal was itself based on the
workflows, but it is very adequate for the testing of new ideas     more experimental Swiss Protein Identification Toolbox
and quick prototyping, as well as for automated testing.            swissPIT [12.13]. For proteomics analysis, there are several
                                                                    commercial (like Mascot [14]) and open source (like Corra
   We are operating three portals, the production iPortal, a        [15]) resources available, where users can upload and process
development portal and a testing portal. New workflows, new         data. Our portal differs from these as it allows for more
functionality and capabilities can be easily installed and          automation and parallelism, with the ability to process much
operated on the development and the testing portal. Also end-       more data simultaneously since there is much less user
users can log in and make use of new functionality when the         interaction involved. We also keep track of how the analysis
developers work together with the end-users to build                was conducted, keeping all parameters and settings of the
workflows for new projects. Once development has finished           workflows for further reference.
and sufficiently tested, the new items can be deployed on the
production server.                                                      In the SCI-BUS project there are many portals being built
                                                                    or extended using the same technology we use, like the
              V.    SUMMARY AND FUTURE WORK                         MosGrid molecular life science portal [18]. Another very
   The new iPortal is addressing several requirements that we       popular and easy to use gateway is Galaxy [19], which can
have collected by interacting with the users, the workflow          easily be extended with bioinformatics tools, but is also not
developers and also the experts operating the initial Swiss         geared towards large-scale analysis. Galaxy however already
Proteomics Portal. By moving to a modern technology                 has a cloud binding called CloudMan [20] that allows the usage
                                                                    of Amazon for the processing of certain workloads.
                             REFERENCES                                            spectrometry data in clinical proteomics.”, Stud Health Technol Inform,
                                                                                   126: 13–22, 2007.
[1]  P. Kunszt, L. Espona Pernas, A. Quandt, E. Schmid, E. Hunt and L.
                                                                              [13] A. Quandt, A. Masselot, P. Hernandez, C. Hernandez, S. Maffioletti,
     Malmström, “The Swiss Grid Proteomics Portal”, Proceedings of the
                                                                                   R.D. Appel, F. Lisacek, “SwissPIT: An workflow-based platform for
     Second International Conference on Parallel, Distributed, Grid and
                                                                                   analyzing tandem-MS spectra using the Grid.”, Proteomics, 9(10):
     Cloud Computing for Engineering, P. Iványi and B.H.V. Topping,
                                                                                   2648–2655, May 2009
     (Editors), Civil-Comp Press, Stirlingshire, Scotland (2011)
[2] P. Kacsuk, G. Sipos, “Multi-Grid, Multi-User Workflows in the P-          [14] D.N. Perkins, D.J.C. Pappin, D.M. Creasy, J.S. Cottrell, “Probability-
                                                                                   based protein identification by searching sequence databases using mass
     GRADE Grid Portal”, Journal of Grid Computing, 3(3-4): 221–238,
     2005.                                                                         spectrometry data”, Electrophoresis, 20(18): 3551–3567, 1999.
[3] A. Bauch, I. Adamczyk, P. Buczek, F-J. Elmer, K. Enimanev, P.             [15] M.Y. Brusniak, B. Bodenmiller, D. Campbell, K. Cooke, J. Eddes, A.
                                                                                   Garbutt, H. Lau, S. Letarte, L. Mueller, V. Sharma, O. Vitek, N. Zhang,
     Glyzewski, M. Kohler, T. Pylak, A. Quandt, C. Ramakrishnan, C.
     Beisel, L. Malmstrom, R. Aebersold, B. Rinn, “openBIS: a flexible             R. Aebersold, J. Watts, “Corra: Computational framework and tools for
                                                                                   LC-MS discovery and targeted mass spectrometry-based proteomics”,
     framework for managing and analyzing complex data in biology
     research”, BMC Bioinformatics (2011) Vol.12, Issue: 1, 468.                   BMC Bioinformatics, 9(1): 542, 2008, ISSN 1471-2105, URL
[4] R. Murri, P. Kunszt, S. Maffioletti, V. Tschopp, “GridCertLib: A Single   [16] O. Kohlbacher, K. Reinert, “OpenMS and TOPP: Open Source Software
                                                                                   for LC-MS Data Analysis”, in Proteome Bioinformatics, Volume 604 of
     Sign-on Solution for Grid Web Applications and Portals“, Journal of
     Grid Computing, December 2011, Volume 9, Issue 4, pp 441-453                  Methods in Molecular Biology, Chapter 14, pages 201–11. 20
[5] M.A. Steinemann, C. Graf, T. Braun, M. Sutter, “Realization of a          [17] M.R. Berthold, N. Cebron, F. Dill, T.R. Gabriel, T. K¨otter, T. Meinl, P.
                                                                                   Ohl, K. Thiel, B. Wiswedel, “KNIME - the Konstanz information miner:
     Vision: Authentication and Authorization Infrastructure for the Swiss
     Higher Education Community” (Educause 2003).                                  version 2.0 and beyond”, SIGKDD Explorations, 11(1): 26–31, 2009.
[6] A. Keller, A.I. Nesvizhskii, E. Kolker, R. Aebersold, “Empirical          [18] M. Wewior, L. Packschies, D. Blunk, D. Wickeroth, K. D. Warzecha, S.
                                                                                   Herres-Pawlis, U. Lang, et. al, “The MoSGrid Gaussian Portlet–
     Statistical Model To Estimate the Accuracy of Peptide Identifications
     Made by MS/MS and Database Search”, Analytical Chemistry, 74(20):             Technologies for the Implementation of Portlets for Molecular
                                                                                   Simulations”. In Proceedings of the International Workshop on Science
     5383–5392, 2002.
                                                                                   Gateways (IWSG10) (pp. 39-43).
[7] A.I. Nesvizhskii, A. Keller, E. Kolker, R. Aebersold, “A Statistical
                                                                              [19] B. Giardine, C. Riemer, R.C. Hardison, R. Burhans, L. Elnitski, P. Shah,
     Model for Identifying Proteins by Tandem Mass Spectrometry”,
                                                                                   A. Nekrutenko, “Galaxy: a platform for interactive large-scale genome
     Analytical Chemistry, 75 (17): 4646–4658, 2003.
                                                                                   analysis”. Genome research, 15(10), 1451-1455. (2005)
[8] The applicake software, http://sybit.net/software/1344307-applicake
                                                                              [20] E. Afgan, D. Baker, N. Coraor, B. Chapman, A. Nekrutenko, J. Taylor,
[9] L. Goodstadt, “Ruffus: a lightweight Python library for computational          “Galaxy CloudMan: delivering cloud compute clusters”. BMC
     pipelines”, Bioinformatics (2010) 26 (21): 2778-2779.                         bioinformatics, 11(Suppl 12), S4. (2010)
[10] P. Kacsuk, K. Karoczkai, G. Hermann, G. Sipos, J. Kovacs, "WS-           [21] A. Keller, AI. Nesvizhskii, E. Kolker, R. Aebersold, “Empirical
     PGRADE: Supporting parameter sweep applications in workflows,"                statistical model to estimate the accuracy of peptide identifications made
     Workflows in Support of Large-Scale Science, WORKS 2008. pp.1,10,             by MS/MS and database search”. Anal Chem. 2002;74:5383–92.
     17-17 Nov. 2008
[11] P. Kacsuk, “P-GRADE portal family for grid infrastructures”,
     Concurrency and Computation: Practice and Experience, Special Issue:
     IWPLS 2009, Volume 23, Issue 3, pages 235–245, 10 March 2011
[12] A. Quandt, P. Hernandez, P. Kunszt, C. Pautasso, M. Tuloup, C.
     Hernandez, R.D. Appel, “Grid-based analysis of tandem mass

</pre>