=Paper=
{{Paper
|id=None
|storemode=property
|title=Improving the Swiss Grid Proteomics Portal: Requirements and new Features based on Experience and Usability Considerations
|pdfUrl=https://ceur-ws.org/Vol-993/paper18.pdf
|volume=Vol-993
|dblpUrl=https://dblp.org/rec/conf/iwsg/KunsztBHSSWRERQM13
}}
==Improving the Swiss Grid Proteomics Portal: Requirements and new Features based on Experience and Usability Considerations==
Improving the Swiss Grid Proteomics Portal Requirements and new Features based on Experience and Usability Considerations Peter Kunszt, Lorenz Blum, Bernd Rinn, Franz-Josef Elmer, Andreas Quandt, Lars Malmström Béla Hullár, Emanuel Schmid, Chandrasekhar Ramakrishnan Institute for Molecular Systems Adam Srebniak, Witold Wolski Center of Information Sciences and Biology, SystemsX.ch SyBIT, ETH Zurich Databases, ETH Zurich, Department of ETH Zurich, Department of Biology and Swiss Institute of Bioinformatics Biosystems Science and Engineering Zurich, Switzerland Zurich, Switzerland and Swiss Institute of Bioinformatics Basel, Switzerland Abstract—We have received feedback from our users and for all projects. SyBIT also provides and supports middleware supporters on the functionality and usability of the Swiss Grid to manage and catalog the large amounts of collected data. Proteomics Portal during its first year of operation. We have also Finally, SyBIT maintains and supports a whole toolbox of realized which aspects of the portal could be improved upon software to be sure that all project's needs are met. Most of the through frequent monitoring and interaction with the production middleware and tools are already well established community system under heavy use. In a second, highly upgraded version of standard tools and libraries, to which SyBIT contributes the Swiss Proteomics Portal, called iPortal, we have introduced wherever needed, improving the functionality using software several new concepts based on this feedback and both user and engineering in the process to the benefit of these communities. supporter experience. In this paper we detail the requirements SyBIT also provides training to the research groups in the and the improvements we have made, and also give an outlook on future possible improvements. usage of these tools and the integrated research infrastructure. In a previous publication we have already described the Keywords—gateway; ease of use; portal; proteomics Swiss Grid Proteomics Portal [1], aiming to provide an easy-to- use but powerful portal for standardized proteomics data I. INTRODUCTION analysis. In this paper we elaborate on the experiences of Recent advances in observational technologies have turned operating the portal, leading to new requirements and the the Life Sciences into a data-intensive science. Microscopy implementation of new features. A lot of the considerations imaging, mass spectrometry, gene sequencing and other that led to better usability may be relevant to similar efforts, technologies are available at a relatively low cost to the and are summarizing our best practices for sustained operations research labs, turning many labs into large data producers. The of the improved proteomics portal, that we now call iPortal. precision with which biological processes can be observed The name was inspired by the ease of use of Apple’s products: today provides the researchers with a very large amount of we also want to give our users a fun experience and a self- complex information, which has to be analyzed, processed and explanatory portal interface. understood. The relatively new field of Systems Biology aims to integrate and model several scales of observational data of a II. ADDRESSING REQUIREMENTS IN PROTEOMICS given biological system, which can be a cellular organism like In several SystemsX.ch projects, proteomics data needs to yeast or an organ of a larger organism like the wing of the fruit be collected as part of the overall system biology analysis. fly. The system is analyzed as a whole and models are built to Mass spectrometry is used to identify and quantify the protein understand its behavior. Due to the many layers of complexity content of a given biological sample. The analysis of the raw already involved, researchers are in need of specialized data collected by the mass spectrometers is a research domain assistance to deal with the complexity of the digital on its own, and there is a very large number of community infrastructure involved. The SyBIT project of SystemsX.ch, the tools available to reconstruct the protein content from the mass Swiss National Initiative in Systems Biology, has been set up spectra generated by the instruments. The complexity of to provide this support to all research projects in the proteomics data analysis is large, as the analysis itself depends SystemsX.ch initiative, which involve over 200 research labs in strongly on the sample being observed, and the biological Switzerland. With the additional SystemsX.ch funding, many question at hand. Until recently, all analysis of mass new instruments were provisioned at the participating spectrometry data was done through a series of manual steps, institutions. Several projects are producing raw observational making use of individual command-line tools with different data on a large scale, on the order of terabytes per instrument parameters for the heterogeneous data analysis. First raw data per month, week or in some cases daily. SyBIT collects the formats produced by the various instruments need to be requirements on data processing and works with the local and converted to open standard formats. Then they are analyzed central resource providers to make sure that the necessary using one or several of the community tools available. Often infrastructure is available for data storage and data processing these tools were produced as parts of a research project, with This work was supported by the SystemsX.ch, the Swiss National Initiative for Systems Biology and by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 283481 (SCI-BUS project). poor adherence to standard data formats, so there is a lot of 6) There is a very large heterogeneity in input datasets that are data transformation involved to assure that the output of one needed for the protein identification workflows. It was very tool can serve as the input of another. Scripts and specialized cumbersome and error-prone to select different input workflows have been built by the researchers and datasets. bioinformaticians in proteomics to automate some of their 7) Developers of completely new algorithms also need access steps, but these were often not kept track of, or had hardcoded elements for specific environments, people or projects, not to a portal-like infrastructure. This kind of easy deployment of high-turnaround custom workflows was not enabled with really intended or suitable to be reused by others the first portal. In SyBIT we needed to address several issues to enable In the context of the EU FP7 SCI-BUS project we have reusable, traceable proteomics analysis workflows for large amounts of proteomics data. First, we had to make sure that all upgraded the portal to the next-generation technology using Liferay portal technology and the gUSE/WS-PGRADE raw data is well tracked and annotated for future reference. Data needs to be searchable based on criteria like project, workflow managers [10,11]. By changing to a more modern, modular technology we could now start to address the issues observer, timestamp, biological context and other user-defined parameters. We had to also make sure that the raw data is observed by the users of the first portal, extending and improving the portal. We could formulate the following high- stored such that it can be retrieved easily for analysis and also future re-analysis. For traceable, large-scale data management level requirements based on the feedback above: we are using the open biology information system openBIS [3], 1) Authentication and authorization: Each user needs to use which we are continuously improving to support our their cluster account, to be requested and signed for communities. Second, we need to find a way to process the separately, adhering to local usage policy. data using the various analysis tools available in the community. We need to be able to adapt and change the data 2) The individual workflows need to be configurable to a processing pipelines while keeping track of the steps and much higher degree, giving much more possibilities to parameters involved, to assure that the results are traceable and customize the workflow. reproducible. For this purpose we have built the Swiss Grid 3) Input and output management from and to the individual Proteomics Portal [1], based on the P-GRADE grid portal workflow nodes needs to be managed at a lower level. system [2]. This first portal has been put to production in 2010. Error reporting and logging needs to be standardized for the workflow system to be able to cope with the various failure III. EXPERIENCE WITH THE FIRST PROTEOMICS PORTAL AND modes. NEW REQUIREMENTS 4) The usability and intuitiveness of the portal needs to be We have made several observations and collected feedback heavily improved. Researchers not intimately involved in from the users and the resource operators of the Swiss data processing algorithms need to be guided through the Proteomics Portal by interacting with the users, either in direct personal discussions or through email. We have presented the process of selecting and configuring a workflow, associating their data with it and retreiving and registering portal at several internal seminars and we have provided training in its usage. We have also collected all the their results. Workflows need to be categorized by research topic. Also Monitoring of running workflows needs to be requirements in our bug tracking system and have evaluated their relevance regularly in user meetings. The requirements simplified for the end-user, but an administrator should be able to dig down into the relevant logs in case of failures. are as follows: Administrators need to see the logs of all running 1) Resource providers were not happy with the portal being workflows to be able to invoke procedures to rescue failed operated under a single user name, not being able to jobs also through the portal. distinguish who makes use of their resources, in our case 5) If Grid Certificates are used, they should be invisible to the the local HPC cluster managers. New users are required to user. sign the cluster usage policies and rules, and this did not occur through the portal. 6) UniProt, SwissProt reference datasets used in protein identification workflows often need to be extended with 2) The existing workflows did not have enough specific proteins being searched in a given experiment. customizability for the end user. Also, setting of parameters These reference datasets need to be easily managed and was not straightforward. selected for identification workflows by the user. 3) Error tracing through the individual workflow steps was 7) Workflow algorithm developers need either a pluggable extremely difficult. architecture to modify existing workflow nodes easily or a 4) Running and rerunning the workflows was not intuitive for mechanism to submit workflows outside of the interactive the user and it was not straightforward for an administrator portal. to see what went wrong to be able help the user out in a short timeframe. Debugging took too much effort. IV. IMPLEMENTATION OF THE REQUIREMENTS 5) We also tried to use Grid Certificates, but too many users For all of the requirements above we have implemented were not able to make use of them on their own. modules or extensions to the Proteomics Portal. We are calling the new Liferay/gUse/WS-PGRADE-based portal including all these extensions iPortal to distinguish it from the previous P- C. The Workflow Wizard GRADE based Swiss Grid Proteomics Portal versions. The main change to the way the Swiss Grid Portal is now A. Integrating with Cluster Authorization perceived by the end-user was the introduction of the Workflow Wizard in the iPortal. This has been implemented as The first requirement is to make use of individual user another Liferay portlet. The users can select the Workflow accounts on the local cluster. Having individual user accounts Wizard as one of the top level tabs on the main page of the was the requirement of the cluster operators at the ETH Zurich. portal. The Workflow Wizard guides the user through a series Each user has to request an account on the local Brutus cluster of steps: through the usual means, by filling out a web form and agreeing to the terms of use. We have built a portlet into iPortal 1) Workflow type selection. The first step is to select the type that comes into play whenever a new user is registered. This of workflow that the user wants to run. Currently there are portlet is activated at first time log-in of a new user. two workflow types, search and quantification. The search workflow implements several flavors of the trans For each user, we create a new openssl secure public- proteomic pipeline [6,7], that is used to identify the private keypair. We store the private key in the secure portal peptides and proteins in the raw data as received from the database. At the first-time login, the user is asked to log into mass spectrometers in a proteomics experiment. The the cluster (a popup window asks him or her to enter their quantification workflows make use of the result of a search cluster username and password), and a session is established to workflow and additional information based on historical their cluster user account. The portal copies the new public key reference data, to quantify the abundance of the proteins into their ssl directory as a new authorized user. Also, a new with respect to one another in a given sample. Also here configuration file is added that will be sourced whenever the there are several workflows that can be selected based on portal is submitting jobs on the user’s behalf by making use of the type of experiment and the type of sample being their account. From now on, the interaction with the cluster is always through the individual user’s accounts. Of course, the iPortal users need to apply for and receive a cluster account before being able to register for the iPortal. The popup window requesting their cluster account credentials informs them of this fact and provides the link to the cluster registration page. Through this mechanism we can completely fulfill the requirement for cluster registration and running cluster jobs using the individual user accounts. We are in close collaboration with the cluster administrators to make sure that our security mechanisms are trusted and adhere to their policies. analyzed. The picture below shows this first step as shown B. Using Grid Certificates in the iPortal. We have implemented a mechanism to make use of Grid Certificates based on SAML assertions in a previous project 2) The second step selects the actual workflow of the given called GridCertLib [4]. It can create a proper X.509 certificate type. Each workflow comes with a name and a one- based on the user’s AAI login, which is available to all paragraph description, with a link to further information researchers in Switzerland based on the SWITCH-AAI and detailed workflow description in the project wiki countrywide service [5]. This would fulfill requirement 5), that pages. All of these workflows have been created and tested the users do not need to deal with Grid Certificates when by workflow developers, by making use of the gUse submitting to distributed infrastructures. workflow editor. Now the regular users do not need to interact with this editor anymore but are given the choice However, the GridCertLib implementation relies on the among many predefined workflows. The user can choose delegation feature of Shibboleth, which is a new feature and one of the workflows by clicking on the corresponding not yet available on the current infrastructure. All SWITCH- ‘choose’ button. AAI enabled institutions would need to upgrade their identity provider service and would need to configure this service 3) The third step is the selection of the data that needs to be accordingly. This has proven to be an insurmountable analyzed with the workflow. Depending what workflow administrative hurdle for the past 2 years, unfortunately. As was chosen, the user is given the right type of data that he elegant as this solution is, it is not usable in practice. In the or she has access to, presented in a table format. This list is production Swiss Proteomics Portal, the users are therefore still generated on the fly by submitting a query to the openBIS expected to upload their proxy certificates to a myproxy server information system [3], where all the data are indexed and outside of the portal if they want to make use of Grid resources. annotated with the relevant metadata. Users only see their own datasets or data they have been given access to by We are now exploring other technologies that could be used others. also in the context of cloud infrastructures, but they are not We have decided to use openBIS as the data management mature enough yet. So this requirement is, unfortunately, still hub for all of the proteomics data already for the first not met. version of the Swiss Proteomics Portal. Data that is being 5) Finally, the workflow is ready for submission. The user generated at the mass spectrometers is automatically receives an overview of the workflow to quickly check ‘uploaded’ into openBIS: a monitoring process checks the that everything is in order or whether changes need to be contents of the directory into which new datasets are made, in which case the ‘back to previous step’ link can be generated, scans these for automatically available used to go back to the corresponding step to change the metadata, registers the new data in the openBIS database settings. If the workflow and its configuration are found to and moves the data files to the central data store from be correct, the workflow can be submitted by clicking on where they can be made available to the data consumers. the ‘submit’ button. The wizard makes use of the gUSE These can be processing steps as part of a workflow by Application Specific Module (ASM) interface to select making use of the rich openBIS API or it can be accessed and execute the predefined workflows through gUSE. interactively by the user over the web interface or directly Once a workflow was successfully submitted, the workflow using the APIs, for example in Matlab. The data registration is highly configurable: a python wizard asks the user whether another workflow should be created with identical settings of the current workflow. This script can be customized to extract the available metadata from the raw data and to register it in openBIS. For the was one of the requests we have received from the users, often they want to submit the same workflow several times but with researchers it is very convenient to see their newest mass different datasets, and this helps them to do so more quickly as they only need to select the dataset, all the other settings are remembered by the wizard and are provided as default settings for the next session. The Workflow Wizard was a very large improvement in terms of usability for our users, and has improved the acceptance of the iPortal. We are continuously extending the wizard with new workflows and are planning also new types of workflows. The Workflow Wizard addresses requirement 2), ie. the request for more customization possibilities in the workflow. In step 4, especially with the ability to store custom parameter sets, users can adjust every parameter of the workflow. Together with the new monitoring portlet described below, the very important usability requirement 4) has been addressed as well to a large degree. D. Improved Monitoring The workflow monitoring page as provided by WS- spectrometry run already cataloged and available soon PGRADE has been perceived as very overloaded by many of after their sample has been processed by the instrument. the proteomics portal users, and we have also experienced first- Users interact with openBIS through its web interface hand that many users simply did not find the information they where they can browse and access the data. We have were looking for. We have therefore decided to make use of the consciously decided to keep the data management gUSE ASM interface again to provide a more intuitive view on interface separate from the processing interface over the current state of the user’s workflows by implementing a iPortal, in order not to overload either one. Depending on monitoring portlet. It can be accessed through another main tab what kind of workflow was chosen, the data selection can on the portal at any time. In the initial view, the users see a list be a two-step process. In the case of quantification of their workflows in a simple table, color coded whether they workflows for example, first an experiment context needs are running, completed successfully or aborted with an error. to be chosen, then the datasets from within that experiment Clicking on the one of the workflow lines will open a second can be specified which should be analyzed with the given table below the first one, where the users get a detailed list of workflow. The next picture shows this data selection step. all job types (workflow nodes) in the given workflow, with an indication of success or failure on this level. Again the user can 4) The fourth step is the parameterization of the workflow. click on one of these items to get access to the detailed logs of Each workflow comes with a number of predefined that particular node of the workflow. Usually one is most parameterizations, provided by the workflow developers, interested in the node that shows warnings or errors. The that are suitable for a various workflow usage scenarios. detailed log view opens three new panes on the page, The users can create their own parameterization by un- displaying the standard output, the standard error and the gUSE hiding the detailed parameter settings. They can then save logs of the job. There are only very rare cases when this view is their own parameter sets under a new name, which can be not sufficient to understand why a certain failure has occurred. used for future workflow parameterizations. These parameterizations may also be shared among users. For However, as mentioned in the requirements, we have search workflows, the parameterization step also includes realized early on that many users cannot extract the necessary the selection of the input database (see BioDB section information from the logs to understand why a particular below). workflow has failed, simply because they do not know what to look for. Very often a user needs a supporter to help them to browse the logs and to understand the root cause of the • All messages and errors are written to configurable problem. The reason for a failed workflow is often just a log files or the standard output and standard error. random cluster node failure or data access issue, or a job that ran out of time or memory for some reason unrelated to the job • The error messages themselves are standardized and itself. More rarely we see wrong parametrizations, erroneous allow for automated error management in the future. datasets or input data selections. In the case of cluster failures, • The ini files can be stored with the result of the we have started to build in automatic resubmissions and retries, workflow for future reference, allowing for the which are very common in such environments. Another complete tracking of the workflow for future addition we made is a ’monitoring administrator’ role, which reference. can be assigned to supporters. With such a role, a supporter not only monitors his own workflows, but the workflows of all Using applicake wrappers [8], the gUSE workflow now users in the portal. If workflows of a user fail, a monitoring only needs to be configured with a single input and a single administrator can check the logs of that user within his own output port, specifying the configuration file of the wrapper. monitoring view and take action to rescue the failed workflows This simplifies also the construction of the workflows in the directly. This way the users often do not even realize that gUSE workflow editor, and allows for the collection of all something went wrong and also the supporters do not need to relevant messages in the standard output, standard error and spend a lot of time trying to understand the issues over email gUSE log files as displayed in the monitoring portlet. indirectly. For failures that involve user error (like selecting With applicake, we are addressing requirement 3) to a large wrong parameters) the users of course are being contacted extent. Applicake needs to be extended continuously, as new directly. This mode of operation is much appreciated by the workflows are being implemented with new node types, ie. user community, and allows for a tight interaction between the with executables that have not been wrapped yet with a supporters and the users. validator and using the ini file managing input-output data. E. Workflow Node Wrapping This seems like additional work at first, but it is worth the effort as it provides us with traceability as well as with unified The proteomics workflows are making use of many open validation and error handling and a better possibility for source tools. All of them have different ways of managing testing. input and output files and they also are not standardized on how they report successful processing or errors. Some return F. BioDB with a nonzero error code, others write messages into standard For the proteomics search workflows, reference data has to output, standard error, or into specific log files. We have to be available for the identification algorithms. The reference parse these outputs, validate them for correctness and scan for datasets are usually a subset of the publicly available errors. Additionally, at each step there are output files in a UniProt/SwissProt database, but there are also other custom workflow node that usually serve as input to the next node. reference datasets. These data need to be made available to the Often these need some conversion and validation into the search workflow jobs on the execution machines, but they are suitable format for the next step. quite large and therefore it makes sense to pre-stage them to a We can simplify the error parsing and error processing as well-defined location for read-only access. well as the complexity of the workflows by wrapping each The reference data is usually ‘enriched’ with special executable serving as a workflow node with a python script proteins of interest to the given experiment, with contaminants that standardizes their behavior. Our wrapper, called applicake, (like keratin, often found in human skin or hair that is often implements the following new behavior for the workflow found as contamination in the observed sample) and with so- nodes: called decoy proteins, ie. protein sequences that do not exist in • Each node has only one input file and only one nature, which are later used to estimate false discovery rates output file. The output of the previous node serves [21]. There are several algorithms available to build decoy as the single input file of the next node. proteins, one of the popular ones is to simply reverse the order of amino acids the sequence of an existing protein, as this will • These files contain only metadata, ie. a set of key- provide decoys with identical overall mass to real proteins. value pairs, describing the properties of the previous and current workflow nodes, their actual input In the case of UniProt, there is also a new release every parameters and data files. Currently, this is month that needs to be downloaded from the UniProt implemented as a MS-Windows-like ‘ini’ properties Knowledgebase server. We need to keep track of the different file, containing the relevant key value pairs grouped versions of the UniProt datasets to be able to reproduce into sections. previous results, or for larger experimental projects that would be too expensive to re-search every time a new UniProt version • The nodes only extend the previous ini file, they do is available. not remove data, leaving a trace of the whole workflow process. Since every search workflow is run on different samples, very often specific reference datasets need to be constructed for • The nodes all validate their input and output and are the corresponding organisms, contaminants and the most exiting with standardized error codes or warnings. suitable decoy algorithm. So we end up with a large number of flavors for the UniProt knowledgebase. We have realized early on that we need to provide an easy-to-use mechanism for users to select their reference dataset in their workflow. Currently (Liferay, gUSE and WS-PGRADE from GridSphere and P- this is implemented as a drop-down list in step 4 of the GRADE), we could make use of a modular architecture to Workflow Wizard. improve existing components and interfaces and to implement several new parts at all layers of the portal. We provide an automated tool we call BioDB to regularly download the UniProt Knowledgebase, provide a versioning For the end-user we have created a workflow wizard where based on the date of the download, and to enrich it with a the user can select from several predefined workflows, but with default set of contaminants and decoys, splitting it by the most the capability to customize in detail all parameters of the commonly used organisms in the iPortal. workflow, and to store and share parameter sets. For the supporters and portal administrators it was essential to see the The BioDB has four components: a download agent that monitoring information of all users so that they can quickly fetches the original data from the public data providers, a understand and fix problems, even before the users themselves publication agent that makes the enriched dataset available, a realize them. For this, we have also introduced a node wrapper subscriber agent that downloads and installs the data on the framework to homogenize node input-output management, for local resource and finally a central registry that keeps track of result validation and to unify error messages. Now the users all publishers and subscribers. We are running a BioDB receive better support and more meaningful error messages. subscriber agent on the ETH Brutus cluster to assure that all datasets are available on the cluster scratch filesystem in a Finally, for the developers of new algorithms and well-known location so that all search jobs can just access the workflows, we have also provided new ways for quick right reference dataset based on the user’s selection (which is prototyping and made the porting of workflows straightforward kept in the ini file provided through applicake). The users can to the production portal. run providers on their own custom dataset and register it in the system. Such a design assures that BioDB is an independent For future work, we can improve further on all aspects mentioned above. In terms of security, we need to find new module that simply makes the necessary datasets available at the right resource without any intervention from the user. With ways to enable federated identity management frameworks, as currently we are still using certificates to access distributed grid BioDB we address requirement 6). resources. We also want to be able to access public cloud G. Workflow Development using Ruffus infrastructures through the gUSE DCI-Bridge interfaces as The final requirement was to allow workflow developers to provided through the SCI-BUS project. In the Wizard we are run workflows in a pluggable manner, also outside of the continuously improving the intuitiveness based on user portal. By making use of applicake and the Python Ruffus feedback, changing the look and feel of the wizard in the package [9], workflow developers can quickly test their process. workflows locally or directly on the cluster. Ruffus is a In BioDB, we need to improve the management of the lightweight workflow library that can deal with dependencies, personal datasets that vary from user to user, as now the list of parallelism and also provides some error handling. Once a BioDB databases has grown too long. In the applicake workflow has been sufficiently tested with Ruffus, it is very framework we are looking to replace the ini files with the straightforward to build a gUSE workflow. In fact, the common tool description (CTD) format developed by the developer of the original Ruffus workflow can usually hand OpenMS team [16]. This would enable us also to make use of over the code to a gUSE expert who has no difficulty to turn it KNIME [17] instead of Ruffus for workflow development and into a proper gUSE workflow. Ruffus is also very useful for testing. automated regression testing of the applicake nodes. VI. RELATED WORK With Ruffus we address the final requirement 7) to a sufficient degree. Ruffus cannot be used for more complex The Swiss Grid Proteomics Portal was itself based on the workflows, but it is very adequate for the testing of new ideas more experimental Swiss Protein Identification Toolbox and quick prototyping, as well as for automated testing. swissPIT [12.13]. For proteomics analysis, there are several commercial (like Mascot [14]) and open source (like Corra We are operating three portals, the production iPortal, a [15]) resources available, where users can upload and process development portal and a testing portal. New workflows, new data. Our portal differs from these as it allows for more functionality and capabilities can be easily installed and automation and parallelism, with the ability to process much operated on the development and the testing portal. Also end- more data simultaneously since there is much less user users can log in and make use of new functionality when the interaction involved. We also keep track of how the analysis developers work together with the end-users to build was conducted, keeping all parameters and settings of the workflows for new projects. Once development has finished workflows for further reference. and sufficiently tested, the new items can be deployed on the production server. In the SCI-BUS project there are many portals being built or extended using the same technology we use, like the V. SUMMARY AND FUTURE WORK MosGrid molecular life science portal [18]. Another very The new iPortal is addressing several requirements that we popular and easy to use gateway is Galaxy [19], which can have collected by interacting with the users, the workflow easily be extended with bioinformatics tools, but is also not developers and also the experts operating the initial Swiss geared towards large-scale analysis. Galaxy however already Proteomics Portal. By moving to a modern technology has a cloud binding called CloudMan [20] that allows the usage of Amazon for the processing of certain workloads. REFERENCES spectrometry data in clinical proteomics.”, Stud Health Technol Inform, 126: 13–22, 2007. [1] P. Kunszt, L. Espona Pernas, A. Quandt, E. Schmid, E. Hunt and L. [13] A. Quandt, A. Masselot, P. Hernandez, C. Hernandez, S. Maffioletti, Malmström, “The Swiss Grid Proteomics Portal”, Proceedings of the R.D. Appel, F. Lisacek, “SwissPIT: An workflow-based platform for Second International Conference on Parallel, Distributed, Grid and analyzing tandem-MS spectra using the Grid.”, Proteomics, 9(10): Cloud Computing for Engineering, P. Iványi and B.H.V. Topping, 2648–2655, May 2009 (Editors), Civil-Comp Press, Stirlingshire, Scotland (2011) [2] P. Kacsuk, G. Sipos, “Multi-Grid, Multi-User Workflows in the P- [14] D.N. Perkins, D.J.C. Pappin, D.M. Creasy, J.S. Cottrell, “Probability- based protein identification by searching sequence databases using mass GRADE Grid Portal”, Journal of Grid Computing, 3(3-4): 221–238, 2005. spectrometry data”, Electrophoresis, 20(18): 3551–3567, 1999. [3] A. Bauch, I. Adamczyk, P. Buczek, F-J. Elmer, K. Enimanev, P. [15] M.Y. Brusniak, B. Bodenmiller, D. Campbell, K. Cooke, J. Eddes, A. Garbutt, H. Lau, S. Letarte, L. Mueller, V. Sharma, O. Vitek, N. Zhang, Glyzewski, M. Kohler, T. Pylak, A. Quandt, C. Ramakrishnan, C. Beisel, L. Malmstrom, R. Aebersold, B. Rinn, “openBIS: a flexible R. Aebersold, J. Watts, “Corra: Computational framework and tools for LC-MS discovery and targeted mass spectrometry-based proteomics”, framework for managing and analyzing complex data in biology research”, BMC Bioinformatics (2011) Vol.12, Issue: 1, 468. BMC Bioinformatics, 9(1): 542, 2008, ISSN 1471-2105, URL [4] R. Murri, P. Kunszt, S. Maffioletti, V. Tschopp, “GridCertLib: A Single [16] O. Kohlbacher, K. Reinert, “OpenMS and TOPP: Open Source Software for LC-MS Data Analysis”, in Proteome Bioinformatics, Volume 604 of Sign-on Solution for Grid Web Applications and Portals“, Journal of Grid Computing, December 2011, Volume 9, Issue 4, pp 441-453 Methods in Molecular Biology, Chapter 14, pages 201–11. 20 [5] M.A. Steinemann, C. Graf, T. Braun, M. Sutter, “Realization of a [17] M.R. Berthold, N. Cebron, F. Dill, T.R. Gabriel, T. K¨otter, T. Meinl, P. Ohl, K. Thiel, B. Wiswedel, “KNIME - the Konstanz information miner: Vision: Authentication and Authorization Infrastructure for the Swiss Higher Education Community” (Educause 2003). version 2.0 and beyond”, SIGKDD Explorations, 11(1): 26–31, 2009. [6] A. Keller, A.I. Nesvizhskii, E. Kolker, R. Aebersold, “Empirical [18] M. Wewior, L. Packschies, D. Blunk, D. Wickeroth, K. D. Warzecha, S. Herres-Pawlis, U. Lang, et. al, “The MoSGrid Gaussian Portlet– Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search”, Analytical Chemistry, 74(20): Technologies for the Implementation of Portlets for Molecular Simulations”. In Proceedings of the International Workshop on Science 5383–5392, 2002. Gateways (IWSG10) (pp. 39-43). [7] A.I. Nesvizhskii, A. Keller, E. Kolker, R. Aebersold, “A Statistical [19] B. Giardine, C. Riemer, R.C. Hardison, R. Burhans, L. Elnitski, P. Shah, Model for Identifying Proteins by Tandem Mass Spectrometry”, A. Nekrutenko, “Galaxy: a platform for interactive large-scale genome Analytical Chemistry, 75 (17): 4646–4658, 2003. analysis”. Genome research, 15(10), 1451-1455. (2005) [8] The applicake software, http://sybit.net/software/1344307-applicake [20] E. Afgan, D. Baker, N. Coraor, B. Chapman, A. Nekrutenko, J. Taylor, [9] L. Goodstadt, “Ruffus: a lightweight Python library for computational “Galaxy CloudMan: delivering cloud compute clusters”. BMC pipelines”, Bioinformatics (2010) 26 (21): 2778-2779. bioinformatics, 11(Suppl 12), S4. (2010) [10] P. Kacsuk, K. Karoczkai, G. Hermann, G. Sipos, J. Kovacs, "WS- [21] A. Keller, AI. Nesvizhskii, E. Kolker, R. Aebersold, “Empirical PGRADE: Supporting parameter sweep applications in workflows," statistical model to estimate the accuracy of peptide identifications made Workflows in Support of Large-Scale Science, WORKS 2008. pp.1,10, by MS/MS and database search”. Anal Chem. 2002;74:5383–92. 17-17 Nov. 2008 [11] P. Kacsuk, “P-GRADE portal family for grid infrastructures”, Concurrency and Computation: Practice and Experience, Special Issue: IWPLS 2009, Volume 23, Issue 3, pages 235–245, 10 March 2011 [12] A. Quandt, P. Hernandez, P. Kunszt, C. Pautasso, M. Tuloup, C. Hernandez, R.D. Appel, “Grid-based analysis of tandem mass