=Paper=
{{Paper
|id=None
|storemode=property
|title=WeNMR: Structural Biology on the Grid
|pdfUrl=https://ceur-ws.org/Vol-819/paper4.pdf
|volume=Vol-819
|dblpUrl=https://dblp.org/rec/conf/iwsg/WassenaarDLSVSZBGFRBHJBJGSVDVVFKSFILSVBPMFB11
}}
==WeNMR: Structural Biology on the Grid==
WeNMR: Structural Biology on the Grid Tsjerk A. Wassenaar1,14, Marc van Dijk1, Nuno Loureiro-Ferreira1,15, Gijs van der Schot1, Sjoerd J. de Vries1, Christophe Schmitz1, Johan van der Zwan1, Rolf Boelens1, Andrea Giachetti2, Lucio Ferella2, Antonio Rosato2, Ivano Bertini2, Torsten Herrmann3, Hendrik R. A. Jonker4, Anurag Bagaria5, Victor Jaravine5, Peter Güntert5, Harald Schwalbe4, Wim F. Vranken6,16, Jurgen F. Doreleijers7,8, Gert Vriend8, Geerten W. Vuister9,7, Daniel Franke10, Alexey Kikhney10, Dmitri I. Svergun10, Rasmus Fogh11, John Ionides11, Ernest D. Laue11, Chris Spronk12 , Marco Verlato13, Simone Badoer13, Stefano Dal Pra13,17, Mirco Mazzucato13, Eric Frizziero13, Alexandre M.J.J. Bonvin1,* 1 Bijvoet Center for Biomolecular Research, Faculty of Science, Utrecht University, Padualaan 8, 3584 CH, Utrecht, The Netherlands. 2 Magnetic Resonance Center, University of Florence, 50019 Sesto Fiorentino, Italy. 3 Centre de RMN à très Hauts Champs, Institut des Sciences Analytiques, Université de Lyon, UMR-5280 CNRS, ENS Lyon, UCB Lyon 1, 5 rue de la Doua, 69100 Villeurbanne, France. 4 Institute of Organic Chemistry and Chemical Biology and Biomolecular Magnetic Resonance Center, Goethe University Frankfurt, 60438 Frankfurt am Main, Germany. 5 Institute of Biophysical Chemistry and Biomolecular Magnetic Resonance Center, Goethe University Frankfurt, 60438 Frankfurt am Main, Germany. 6 European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK. 7 Protein Biophysics/IMM, Radboud University Nijmegen, Geert Grooteplein 26-28, Nijmegen, The Netherlands. 8 CMBI, Radboud University Nijmegen Medical Centre, Geert Grooteplein 26-28, Nijmegen, The Netherlands. 9 Department of Biochemistry, Henry Wellcome Building, University of Leicester, Lancaster Road, Leicester LE1 9HN, U.K. 10 European Molecular Biology Laboratory, Hamburg Outstation, Notkestrasse 85, D22603 Hamburg, Germany. 11 Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK. 12 UAB "Spronk NMR Consultancy" Palangos gatvė 4 LT-01402, Vilnius, Lithuania. 13 Istituto Nazionale di Fisica Nucleare, Sez. di Padova, 35131 Padova, Italy. 14 Current address: Groningen Biomolecular Sciences and Biotechnology Institute, Rijksuniversiteit Groningen, Nijenborgh 7, 9747AG, The Netherlands. 15 Current address: Stichting European Grid Initiative (EGI), 140 Science Park, 1098 XG Amsterdam, The Netherlands. 16 Current address: Department of Structural Biology, VIB, and Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium. 17 Current address: Istituto Nazionale di Fisica Nucleare, CNAF, 40127 Bologna, Italy. ABSTRACT number of programs often used in Structural Biology have been The WeNMR (http://www.wenmr.eu) project is an EU-funded made available through portals, including HADDOCK, XPLOR-NIH, international effort to streamline and automate structure CYANA and CS-ROSETTA, MARS, MDDNMR. The implementation determination from Nuclear Magnetic Resonance (NMR) data. of these services, in particular the distribution of calculations to the Conventionally calculation of structure requires the use of various Grid, involves a novel mechanism for submission and handling of softwares, considerable user expertise and ample computational jobs that is independent of the type of job being run. With over 280 resources. To facilitate the use of NMR spectroscopy in life sciences registered users (April 2011), eNMR/WeNMR is currently one of the the eNMR/WeNMR consortium has set out to provide protocolized largest Virtual Organization (VO) in life sciences. With its large and services through easy-to-use web interfaces, while still retaining worldwide user community, WeNMR has become the first Virtual sufficient flexibility to handle more specific requests. Thus far, a Research Community officially recognized by the European Grid Infrastructure (EGI). * To whom correspondence should be addressed. Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 1 INTRODUCTION sufficient resonances have been assigned, restraints can be inferred from the data, pertaining to distances between atoms, dihedral angles, domain orientations, etc. NMR Spectroscopy is one of two techniques that allow When an adequate number of restraints is available, these can be used to calculate a determining three dimensional (3D) structures of set of three-dimensional structures optimally satisfying these restraints. The resulting biomacromolecules, such as proteins, RNA, DNA, and their structures represent the structure of the protein in solution, which is validated against the available experimental data. Although the process is here depicted linearly, complexes, at atomic resolution. Knowledge of their 3D structures intermediate stages may involve iterative cycles of refinement. is vital for understanding functions and mechanisms of action of macromolecules, and for rationalizing the effect of mutations. 3D For each of the steps involved, specialized computer programs structures are also important as guides for the design of new are available, each with its own characteristics and often with its experimental studies and as starting point for rational drug design. own data format. Processing of NMR data has thus become a task An advantage of NMR over X-ray crystallography is that it also for specialists, who can understand the data and their formats, as allows investigation of time-dependent chemical and well as the programs, with installation requirements and usage conformational phenomena, including reaction and folding kinetics details. Furthermore, NMR data processing requires considerable and intramolecular dynamics. For these reasons, NMR plays an data storage and computational resources. These factors together important role within the life sciences. currently represent a barrier for groups in life sciences to employ The principles underlying NMR are modulation of the natural the full power of NMR. Against this background, the eNMR magnetic moment of atomic nuclei, and measurements of how the project was ran as a European initiative funded under the system relaxes back to the initial state (Bloch, 1946; Purcell, et al., Framework 7 e-Infrastructure programme to considerably facilitate 1946). The signal thus obtained is a fading wave consisting of this process. It is now carried on by the WeNMR project since many individual frequency contributions: the Free Induction November 2010. It aims at allowing groups lacking the resources Decay, FID. Typically, up to 27000 different frequencies can be to add NMR to their toolbox, as well as to allow dedicated NMR resolved at the highest magnetic fields that are nowadays available. groups to improve their standard from basic practice towards To investigate the frequency contributions and their decays, such cutting-edge research. measurements have to be repeated many times, due to the low signal-to-noise ratio. To obtain structural information from NMR The main objectives of the WeNMR project are: data, many more, but also more complex measurements have to be • to provide integrated protocols for NMR data processing run, yielding substantial amounts of data that need processing. • to provide access to end users through user-friendly web Processing data from NMR to obtain a 3D structure typically interfaces involves the following steps, summarized graphically in Figure 1. • to exploit Grid technology for computationally First the raw data have to be processed, more specifically Fourier- demanding tasks in structural biology transformed, to obtain spectra revealing the different frequency • to lower the barriers for access to Grid resources in life contributions and their relations. These frequencies are the sciences, notably in structural biology resonances of the atoms measured, but to infer structural • to build a virtual research community around a web information from them, these resonances subsequently have to be portal assigned to individual contributors (atoms/residues). If the • to initiate SAXS (Small-angle X-ray scattering) assignment is sufficiently complete, structural restraints can be integration into the WeNMR project determined from the spectra, including inter-atomic distance restraints, dihedral angle restraints, and orientation restraints. Considering the background sketched, these objectives set the These structural restraints are then used to calculate a number of challenges to be met within the project. The first of these has been structures using a variety of molecular modeling approaches, after the implementation of a new NMR Grid infrastructure. which structure validation checks are performed to assert the Historically, due to the requirements for processing of large quality of the results. amounts of data, NMR spectroscopy has always been intimately linked with high performance computing. Therefore, sites with high-end facilities for performing NMR measurements commonly also have considerable computational resources. For the WeNMR partners it thus came as a natural first step to integrate the existing resources into a Grid, offering a single standard for deployment and use of applications across the contributing sites, as well as a natural mechanism to share resources. Currently, the WeNMR project involves an operational Grid, running gLite 3.1 and 3.2 middleware, and the individual sites are being part of the EGI provided by National Grid Initiatives (NGIs) and their infrastructures from Europe and elsewhere. Having an operational Grid, the programs involved in the different steps, which often require direct user interaction, have to Fig. 1 NMR data processing from signal to 3D structure After acquisition of the be interfaced in such a way that they can be run automatically. primary NMR data, these are Fourier transformed to obtain spectra in which the Focus has been initially placed on the CPU intensive programs, individual frequency contributions or resonances of spin systems, and their relations, which have to be operated remotely as Grid enabled applications. are revealed. The resonances subsequently have to be assigned to individual atoms. If This has to be done in such a way that they can be combined in Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 automated workflows for protocolized processing of data, raising are first subjected to a further cycle of simulated annealing, introducing the issue of interoperability. In addition, web interfaces should be flexibility to allow optimization of contacts. After this, a final cycle of set up to be easy to use, yet sufficiently flexible for expert users. refinement follows, in which the complex is solvated. The results are then At the same time a mechanism is required to handle job traffic to scored, analyzed and returned to the user. The structure calculations are the CPU intensive part of the process and involve a combination of energy and from the Grid. In the following paragraphs, these different minimization and MD (in torsion angle or Cartesian space) simulations. aspects are discussed in more detail, providing an account of the state of the project thus far. But before discussing the more technical details regarding the implementation, the portals that are available are discussed in more detail. 2 METHODS 2.1 The WeNMR web Portals for Structural Biology The web portals developed within WeNMR are among the most important elements of the project, as these form the points of entry for the end users. The ultimate goal is to offer to users registered with the eNMR/WeNMR Virtual Organization (VO) complete online protocols for processing NMR data, including all the steps depicted in Figure 1. In addition, each of these steps, and every program involved has value by itself as web based service. For this reason, a piece-wise implementation has been adopted and programs that are ported to the Grid are simultaneously being made available as a web portal. Currently, 12 portals are operational and can be accessed through http://www.wenmr.eu/wenmr/nmr-services. These provide access, among other services, to HADDOCK (De Vries, et al., 2007; Dominguez, et al., 2003) for the prediction of biomolecular complexes, XPLOR-NIH (Schwieters, et al., 2003), CYANA (Guntert, et al., 1997; Herrmann, et al., 2002) and CS-ROSETTA (Shen, et al., 2008; Shen, et al., 2009) for calculating structures from NMR data, AMBER (Case, et al., 2005) for structure refinement and molecular dynamics (MD) simulations CcpNmr (Vranken, et al., 2005) for data conversion, MARS (Jung and Zweckstetter, 2004) for backbone assignment, TALOS+ (Shen, et al., 2009) for torsion angle prediction, and MDDNMR (Jaravine, et al., 2008) for NUS (Non-Uniform Sampling) spectral processing. Next to these available portals, several new ones are in development for various NMR applications, including the UNIO program (Fiorito, et al., 2008; Volk, et al., 2008) that provides computational routines for each individual step depicted in Figure 1. The main WeNMR portal is shown in Figure 2. 2.2 HADDOCK HADDOCK (De Vries, et al., 2007; Dominguez, et al., 2003) is an acronym for High Ambiguity Driven DOCKing and is a program to predict structures of biomolecular complexes from individual components. As the full name indicates, this approach in docking of biomolecules distinguishes itself from other methods by using external information to guide the docking process. Such information can be empirical, theoretical or both, pertaining to the residues or atoms involved in the binding interface. From this information ambiguous restraints are derived that are used to drive the docking. HADDOCK is particularly useful in predicting complexes from known experimental structures of the partners using NMR data, such as chemical shift perturbations and residual dipolar couplings (RDCs). Chemical shift perturbations and RDCs can be obtained relatively easily and also for macromolecules of increasing size, making the large Fig. 2 The WeNMR web portal (http://www.wenmr.eu/wenmr/nmr-services) applicability of HADDOCK as a tool for cutting edge Structural Biology apparent. HADDOCK has proven its value within the CAPRI (Critical HADDOCK offers almost full control of the many parameters involved in Assessment of PRediction of Interactions) experiment, a blind evaluation of the docking process. To offer the full functionality of HADDOCK through the performance of current docking methods (De Vries, et al., 2007; a web portal thus requires putting forth a complicated form, contrasting Lensink, et al., 2007; Mendez, et al., 2005; van Dijk, et al., 2005). with the objective of having a simple interface. To avoid compromises The docking process starts with random placement of the individual regarding user friendliness and functionality, two innovations were components with a given separation and random orientations. introduced in the design of the portal. First of all, the portal is divided in Subsequently, a large number of complex structures, typically in the order four interfaces, corresponding to different levels of control and user of thousands, is generated by rigid-body docking, driven by the ambiguous experience: restraints. From these a number of structures, typically several hundred, are selected for further refinement, using a scoring function. These structures Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 • The Easy Interface requires no more than providing the two The portal uses a design which is different from the HADDOCK portal, components of a complex and the residues of each that are and is aimed at more direct user interaction during the process. Users log in involved in the interaction. with their Grid certificate loaded in the web browser, gaining access to an • The Expert Interface allows the user to provide his own environment where projects can be started, stored and managed. Structure customized restraints to be included in the docking process and calculation projects are initiated by filling in a form and providing files for to specify certain aspects of the sampling and analysis. In the structures and topological descriptions of the molecules, as well as for addition, using this interface the user can set protonation states the different restraints to be included in the calculations. of histidine residues, and define regions of the interacting When the structure calculations have finished, the user can view and molecules to be kept flexible during the docking. This allows a download the results. In addition, it is possible to select a number of certain degree of conformational change to take place during structures for further refinement and characterization using the AMBER docking. package (Case, et al., 2005) for MD simulations. • The Guru Interface offers almost full control of parameters, allowing e.g. specification of symmetry and relaxation 2.4 CYANA anisotropy restraints and RDCs as well as of parameters Another widely used program for calculating structures from pertaining to the energy, the scoring and the analysis of results. conformational restraints is CYANA (Combined Assignment and • Finally, for complete control a File Upload Interface is dYnamics Algorithm for NMR Applications) (Guntert, et al., 1997; available, where a HADDOCK run parameter file can be Herrmann, et al., 2002). Its main characteristics are the ability for iterative provided. This is particularly useful for those who have their assignment of NOE peaks, and structure calculations through simulated own standard protocol or who want to replicate a previous run annealing in torsion angle space. Like with XPLOR-NIH, the structure with minor modifications. This option also offers a simple way calculations involve many simulated annealing runs, divided over several to build pipelines from other applications. iterative cycles. The design of the web portal for CYANA is similar to that of HADDOCK. The Expert and Guru interface offer control of the docking process at the Foldable menus are used to hide optional sets of parameters, by default expense of making the forms to be filled in more complex. Thus, to presenting an intuitive menu offering a standard structure calculation facilitate the user’s task and keep the forms manageable, foldable menus protocol. The portal allows three modes of invocation of the service. Users were introduced that group related parameters under a single header. In this can request structure calculation using a set of upper distance bound way, users only need to unfold groups of options that should be changed restraints, providing a list of assigned peaks, or providing a list of from their default values. unassigned peaks, in which case the automated peak assignment will be Except for the File Upload Interface, the HADDOCK portals share the data performed. structure, albeit that part of the variables is fixed to predefined values for Use of the service requires having a license for CYANA and registering for the Easy and Expert interfaces. This has the advantage that they can all use of the portal, presenting a valid Grid certificate. This will give a couple to a single back end CGI (Common Gateway Interface) script to username and password that can be used to sign service requests. handle the request, as will be discussed in more detail in the implementation details. 2.5 CS-ROSETTA After issuing a request, the user is presented a link to a site where the Chemical-Shift ROSETTA or CS-ROSETTA (Shen, et al., 2008; Shen, et progress can be followed. After the run is finished, the results can be al., 2009) is the third program for structure calculations that has been viewed online and selected complexes or the complete output data of the ported to the Grid and made available through the WeNMR web portal. run can be downloaded to a local machine. CS-ROSETTA, unlike XPLOR-NIH and CYANA, allows structure The use of the HADDOCK portal requires registration with a valid Grid determination of proteins, based on chemical shift information alone. It certificate, giving a username and password. These are thereafter used to thus bypasses the need for NOE based distance restraints, which usually sign service requests. The requests themselves are handled using an require considerable time to obtain. Further advantages of using chemical eToken-based robot certificate, as is explained in more detail in the shifts are that these are among the most reliable parameters that can be implementation details. obtained from NMR spectroscopy and that they can potentially be obtained for larger macromolecules for which NOEs become impractical. On the 2.3 XPLOR-NIH other hand, direct chemical shift based structure determination is XPLOR-NIH (Schwieters, et al., 2006; Schwieters, et al., 2003) is one of computationally much more expensive than structure calculations using the programs for structure calculations that have been ported to the Grid distance restraints. However, the most time consuming part of a CS- and are available through the WeNMR web portal. It is a versatile program ROSETTA run consists of a large number of independent calculations that that can be operated through a command line interface or with scripts in the can be easily distributed over the Grid. specific XPLOR language. Structure determination using CS-ROSETTA requires as only input the Performing structure calculations using NMR data commonly starts with amino acid sequence and a list of chemical shifts and a number of the generation of an extended conformation from a topological description parameters to control the process that can be changed from the default of the macromolecule. For standard components, such as protein, DNA and values. Backbone chemical shifts for 13Ca, 13Cb, 13C’, 1Ha, 1HN, and 15N that RNA, this topological description can be easily inferred from the sequence are provided by the user, are validated and stored as the target shifts. These of the building blocks. Distance, orientation and other restraints derived chemical shifts are first used to select a set of protein fragments from a from NMR data can then be added to the topological description and used structure database, e.g. the Protein Data Bank (PDB) (Berman, et al., 2000), to drive the system to a folded state using simulated annealing. This based on the list of chemical shifts as predicted with SPARTA. Then the annealing step is repeated many times to obtain sufficient statistics regular ROSETTA protocol (Rohl, et al., 2004) for Monte Carlo assembly regarding the goodness of fit of the structures determined against the and relaxation is used to reassemble the protein from the fragments. For the experimental data. Since the different annealing runs are independent of resulting models the chemical shifts are again predicted using SPARTA each other, they can be easily distributed over multiple CPUs. After the (Shen and Bax, 2007) and the deviations between the predicted and target annealing runs have finished, the best structures are usually selected for values are used as a pseudo-energy term in the scoring of the models, further refinement, including solvent in the calculations. yielding a ranking based on both overall structural quality as well as on the match with the experimental data. Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 The computationally most expensive step in the process is the construction The TALOS+ portal offers a simple interface where an input file in either of a model using Monte Carlo assembly and relaxation. To obtain a reliable TALOS or BMRB format can be uploaded. In addition a number of PDB prediction, a set of 10 000 to 50 000 models has to be built, each starting ID codes can be given, indicating structures that should be excluded from from the same fragment library. Using different seeds for generation of the calculations. Unlike most of the other portals, the TALOS+ portal can random numbers ensures independence of the results from different runs. be used without a Grid certificate, as the calculations are run on a local For the WeNMR implementation of CS-ROSETTA only the Monte Carlo server. search is performed on the Grid. The computational cost involved in chemical shift based structure 2.8 MDDNMR determination makes CS-ROSETTA a typical example of a program that is The MDDNMR (Jaravine, et al., 2008) portal can process individual NUS beyond the capacity of most local sites. Here, the access to Grid resources multidimensional NMR spectra. The interface supports the NUS data through a web-portal, combining computational power and ease of use, recorded by two major spectrometer brands: Varian and Bruker. clearly demonstrates its added value. The main advantage of usage of NUS data is the substantially higher To use the service, users have to register with a valid Grid certificate to resolution in the indirect spectral dimensions. The NUS acquisition mode obtain a user name and password that can subsequently be used to sign of both Vnmr and TopSpin makes use of standard NMR experiments, service requests. The web interface for CS-ROSETTA itself is except that only a fraction of a full data set is recorded. This means that it straightforward and only requires uploading a file containing chemical can be used with virtually any pulse sequence available. After acquisition, shifts in TALOS (Cornilescu, et al., 1999) format, and modifying the MDDnmr replenishes the missing data points in the full matrix. The parameters to control the calculations. When the job has finished, the user resulting regular spectra are then processed conventionally with FFT (fast receives an e-mail containing a link to the result page that gives an Fourier transform), LP (Linear Programming), window functions etc. The overview of the run, including some statistics and images to assess the current portal allows this to be repeated for each experiment in the dataset; overall quality of the results. The user can then select a number of several such high-resolution experiments are processed sequentially as structures to view in more detail or to download, or choose to download the single matrices, and the resulting high-resolution FT-domain spectra, after whole set of results as an archive. peak-picking, are amenable for automatic backbone assignment using MARS. Most of the experimental data types supported by MDDNMR can 2.6 MARS be processed via the portal, including constant-time acquisition (CT), J- The fifth portal, MARS performs automatic backbone assignment of coupling splitting etc. The five use-case examples of 2D, 3Ds, 4D spectra 13 C/15N labeled proteins and is applicable to a wide variety of NMR data, are available for download on the WIKI pages including RDCs (Jung and Zweckstetter, 2004). (http://www.wenmr.org/wenmr/mddnmr-use-case-examples); the examples Its advanced features compare favorably with other assignment tools and have extensive documentation on algorithms and adequate tutorial on include: usage. The design of the web portal for MDDNMR is similar to that of • simultaneous optimization of the local and global quality of Cyana. Use of the service requires registering for use of the portal, assignment to minimize propagation of initial assignment errors presenting a valid Grid certificate. and thus providing robustness against missing chemical shift information; applicable to proteins above 15 kDa using only Ca 2.9 CcpNmr and Cb chemical shift information with connectivity thresholds The CcpNmr portal, for CcpNmr (Vranken, et al., 2005) based data as high as 0.5 ppm; conversions, is not directly related to NMR data processing, but is an • applicable to proteins with very high degeneracy such as important element within the WeNMR project. The reason for this is the partially or fully unfolded proteins; fact that the programs already ported to the Grid often have their own data • combination of the secondary structure prediction program formats, making it impossible to combine these as steps in a direct pipeline. PSIPRED (McGuffin, et al., 2000) with statistical chemical shift Establishing interoperability of such programs requires automated distributions, which were corrected for neighboring residue conversion of output from one step to match the input of a next step. This is effects (Wang and Jardetzky, 2002), to improve identification of exactly what CcpNmr was designed for. It has a comprehensive internal likely positions in the primary sequence; data model that can contain all different types of NMR related data. These • assessment of the reliability of fragment mapping by performing data can be imported from files in a large number of formats. Likewise, the multiple assignment runs with noise-disturbed chemical shifts. data stored in a CcpNmr project can be exported in any of the file format required, provided that the data are present in the model. In this way, Registration and a valid Grid certificate are needed to use the service. Its CcpNmr provides a straightforward approach in meeting one of the interface is similar to Cyana: after uploading peak lists a file containing challenges of the WeNMR project, namely establishing interoperability of assigned chemical shifts is downloaded after job has finished. A use-case the programs involved in NMR data processing and building automated example is available for download on the WIKI pages protocols for complex tasks. (http://www.wenmr.org/wenmr/mars-use-case-example). The portal for CcpNmr was developed as the program was ported to the Grid, to offer the WeNMR members an easy solution for matching program 2.7 TALOS+ output and program input during the steps involved in processing of their TALOS+ (Shen, et al., 2009), like its predecessor TALOS (Cornilescu, et data. At present, the portal allows conversion between several different file al., 1999), is a program to predict torsion angles for amino acids given formats, aimed at facilitating the use of the other portals. The file information regarding chemical shift and the probable regions in the conversions for the portal are performed locally, as these are not Ramachandran plot for each type of amino acid. TALOS+ distinguishes computationally intensive. Use of the CcpNmr portal thus does not require itself by the inclusion of a neural network component, the output of which a Grid certificate. is added as an empirical term in the conventional TALOS data base search. To prevent assignment of torsion angles to the backbone of flexible 2.10 AMBER regions, TALOS+ first identifies such regions using the flexibility AMBER (Case, et al., 2005) is a collective name for a suite of programs prediction program RCI developed by Berjanskii and Wishart (Berjanskii that allow users to carry out MD simulations on biological systems. The and Wishart, 2005). web-portal permits the creation and management of MD calculations from a web browser. The portal takes care also of all the grid accounting. A new Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 user with her/his personal certificate installed in the browser can access to another machine. This stage includes input type checking. The next level the portal from WeNMR web page and straightforwardly create a new user involves preparation, trafficking and monitoring of jobs between the server login and password. The site permits the creation of a new single and the Grid. The third layer is the core layer involving the process(es) to calculation or the creation of a project where it will be possible to save a be run on a worker node. The tasks associated with these different levels number of different calculations that belong to a common project (e.g. MD are conceptually unrelated and allow for a component based development refinements of protein structures generated with different ensembles of approach, in which distinct tasks are programmed in a most generic form. restraints). This has the advantage that such building blocks can be easily maintained, The user can select a pre-set MD refinement protocol to energy optimize adapted and reused. NMR protein structures. The currently proposed protocol comprises four To facilitate the component-based implementation, a single, simple model, steps: illustrated in Figure 3, was designed within the WeNMR consortium for the • In the first step the protein structure to be optimized is uploaded. representation of processes. This model characterizes any process as a Here it is possible to upload one pdb file, whose format will be block with four connectors: input, output, dependencies and logging. The automatically validated and converted into the format input and output connectors allow building larger sequences or complex recognized by amber. During this step the user can add explicit workflows. The dependencies are considered static to a process and the water molecules to solvate the protein, add counter ions, insert logging is for messaging and provenance. Logging information can also be new bonds for selected atoms (e.g. protein to metal). For NMR used to check the status and react to errors. Processes are all shaped to structures, which are typically represented by a bundle of 20-40 adhere to this simple model, which can be achieved by rewriting programs different conformers, this first step is carried out only for the or by wrapping them inside a script. Doing so, a set of process modules is first conformer and then automatically applied to all the obtained that can easily be combined in a pipeline. structures in the bundle. For each structure in the bundle, an Process pipelines are often built imperatively, using one of the standard individual job is sent to the grid. scripting languages. But an imperative approach has the drawback that it is • The second step manages the NMR restraints. Four types of inflexible: e.g. a failure will cause the whole pipeline to fail and a process restraint are allowed: NOE, dihedral angles, RDCs and has to be started anew. For this reason, a partial declarative approach was pseudocontact shifts (PCS). The last restraints are the so-called designed, in which direct communication between processes from the paramagnetic restraints. For all restraints it is possible to upload different layers of operation are eliminated. Rather, output from one level Xplor, Dyana, or Cyana files. For paramagnetic restraints, it is that forms the input for another is ‘pooled’ on disk. Processes from the next possible to fit the anisotropy tensor directly in the web site. layer that depend on these data are run periodically, scanning the pool for • The third step manages the setting of MD calculations. The page data matching the input requirements. This has the advantage that the state can be visualized in a so-called basic mode and in an extended of all processes is naturally check pointed and that the use of computational mode, which allows users to view all the details of the amber resources can be better controlled. How this approach is used to connect the settings. web portals to Grid calculations is illustrated in Figure 4 and explained in • The fourth and final step allows the user to give a name to the more detail below. Note that this description is rather general and some of calculation started and submit it to the grid. After the results of a the portals do not yet adhere to this strict separation of the layers. job have been downloaded, the user can browse them and download the various files. 2.11 UNIO In addition to the portals already available, a UNIO portal has now been tested on local infrastructure and is expected to be finalized and available mid 2011. UNIO comprises elements for all major tasks involved in protein structure determination by NMR (Figure 1). The UNIO portal will allow the user to obtain backbone resonance assignment based on projection NMR spectroscopy of high-dimensional spectra. Such spectra have recently received much interest in the NMR community and presumably represent a substantial and more reliable addition to data analysis programs commonly used for backbone NMR resonance assignment (Volk, et al., 2008). The UNIO portal will be designed as a multiple component web portal, similar to the HADDOCK portal. It will offer expert systems for the subsequent computationally demanding tasks of NMR signal identification, side-chain resonance assignment (Fiorito, et al., 2008) and comprehensive collection of distance restraints (Herrmann, et al., 2002), with the latter task focusing on the primary source of NMR-based protein modeling. UNIO is compatible with powerful NMR structure calculation programs, such as CYANA and CNS, which are already operational on the WeNMR grid infrastructure, and will equip the structural biology community with all computational processing tools necessary for a complete protein structure determination by NMR. Fig. 3 Process model To facilitate implementation and management, processes are represented and, if needed, rewritten in a manner adhering to a simple five node 3 STRUCTURAL BIOLOGY ON THE GRID: model, with the process itself as the central block that has four connectors: an input DESIGN STRATEGIES AND IMPLEMENTATION and an output connector, a connector for dependencies and one for logging. The Successfully running web portals requires a proper machinery to handle output of one process can be connected to the input of another to build larger requests. This machinery involves various steps that can be categorized in sequences or workflows. Obviously, each connection can involve several components, e.g. a process’ input can consist of several files and/or option settings. The process three layers of operation: The server level involves handling of service itself may be regarded a black box, as long as the connections are well-defined. requests, either by direct human interaction or through requests from Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 The data parsed and validated are then processed by a request specific script that combines all data and control parameters required for further processing into a self-contained job package. This package is subsequently placed into a job pool directory, which ends processing of the request at the first level. The second level: Grid job preparation, trafficking and monitoring At the second level, a daemon job is running periodically, scanning the job pool for jobs that are ready to be run on the Grid and submitting these when found. In principle, this daemon job does not require information regarding the nature of the job, although in practice different instances are run, each linked to one type of job to better control the work load associated with the different tasks. A separate daemon job is running, also periodically, checking the status of the jobs running on the Grid and retrieving the results when finished. Alternatively, this process can resubmit the job when it has failed. The results are put back, after validation, in a place where they can be accessed through a web page. Like the submission process, the polling and retrieval process is in principle independent, since all information regarding the job, such as the directory to place the results in, are contained in the job package. Submission, polling and retrieval of output are handled using a standard toolbox for Grid operation, which, in the case of WeNMR, is the gLite 3.1 / gLite 3.2 suite. Accordingly, the jobs that operate at this level require the use of a valid proxy. To facilitate proxy management, all of the processes at the second level of operation are running using an eToken-based robot certificate, in accordance with the security requirements for data portals Fig. 4 Grid job submission management using job pooling The figure shows a formulated by the Joint Security Policy Group general scheme for managing job trafficking to and from the Grid, using server side (https://www.jspg.org/wiki/VO_Portal_Policy). job pooling. This scheme is characterized by a separation of three layers of operation, between which there is no direct communication. Green boxes indicate user The third level: Primary tasks interaction, whereas yellow boxes indicate jobs that are running periodically as daemon jobs and that use an eToken-based robot certificate for generating a Grid proxy. The blue ellipses represent ‘pools’, which are used for storage of job or result The third level of operation involves the tasks running on the Grid. This packages. User service requests are processed on the server, up to the point of requires programs to be ported to the Grid, but that process is relatively generating a job package that is stored on disk. On the Grid UI (User Interface) a straightforward. The only aspect that is different from more common daemon job (grid-submission) is running on a scheduled base scanning the ‘job pool’ strategies is that the processes have to adhere to the process model for job packages and submitting these to the Grid when found. Another daemon job discussed previously, facilitating automation and provenance. (grid-polling) is periodically checking running jobs for their status, retrieving the results when ready and placing these in a result pool. Finally, results are presented back to the user, possibly after post-processing (results-processing). Currently the 4 RESULTS HADDOCK, CS-ROSETTA, UNIO, CYANA, MARS and MDD-NMR portals, which all send jobs to the Grid, are implemented following this model. 4.1 Status and Statistics Since the start of the eNMR project in fall 2007, considerable The first level: Request and data handling, invocation of the service, progress has been made, both in the deployment and the utilization reporting of an infrastructure for structural biology. Currently, the The first level of operation involves the interaction with the user, both infrastructure is distributed over three partner sites, which together processing the request, as well as presenting the results. A service request is provide a body of 272 dedicated CPUs, and 2.87 Tb of storage. made by filling in a form that is parsed by a CGI script. This script also Resources are shared with 16 other sites, giving access to about performs type checking and validation of the user input and presents the 10000 CPU cores and 37 Tb of storage. user a unique ID with which the results can be retrieved. Over the last year, more than 1 million jobs have been run on the Both the web form and the CGI script depend primarily on the data to be Grid, corresponding to about 500 years of normalized CPU time. provided, and it is possible to generate these automatically from a The overall CPU efficiency, the total CPU time divided by the total description of these data. To this purpose, the Spyder framework wall time, of all jobs was 99.0% (statistics taken from the EGI (http://www.spyderware.nl, S.J. de Vries, unpublished) was designed, initially to facilitate setting up and managing the portals for HADDOCK. accounting portal). Next to the generation of web forms, Spyder natively supports data validation for known types and can convert between data types if all Including the twelve applications that have already been made intermediary conversions are defined. Thus, Spyder offers a single available through a web portal, twenty programs have been ported framework for managing most of the elements involved in the first stage of to the Grid, several of which will be made available as web portals processing of requests. in the near future. Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 Currently there are more than 280 users registered, several of Dominguez, C., Boelens, R. and Bonvin, A.M.J.J. (2003) HADDOCK: A protein- which use the portals on a regular basis. The most active portals protein docking approach based on biochemical or biophysical information, J Am Chem Soc, 125, 1731-1737. are the ones for HADDOCK and for CS-ROSETTA, which have Fiorito, F., et al. (2008) Automated amino acid side-chain NMR assignment of processed over 1109 and 545 requests thus far. Together these proteins using (13)C- and (15)N-resolved 3D [ (1)H, (1)H]-NOESY, J Biomol requests account for over 90% of the jobs that have been run on the Nmr, 42, 23-33. Grid, as a result of the task farming approach involved. Guntert, P., Mumenthaler, C. and Wuthrich, K. (1997) Torsion angle dynamics for NMR structure calculation with the new program DYANA, Journal of Molecular 4.2 Conclusions Biology, 273, 283-298. Herrmann, T., Guntert, P. and Wuthrich, K. (2002) Protein NMR structure Since the beginning of the eNMR project, the eNMR/WeNMR determination with automated NOE assignment using the new software CANDID consortium has managed to set up an operational Grid and the torsion angle dynamics algorithm DYANA, Journal of Molecular Biology, (http://www.wenmr.eu/wenmr/wenmr-grid-statistics), to port 319, 209-227. twenty applications and bring up twelve web portals Jaravine, V.A., et al. (2008) Hyperdimensional NMR spectroscopy with nonlinear sampling, J Am Chem Soc, 130, 3927-3936. (http://www.wenmr.eu/wenmr/nmr-services), with several others Jung, Y.S. and Zweckstetter, M. (2004) Mars - robust automatic backbone assignment being finalized. At the time of writing, WeNMR has already grown of proteins, J Biomol Nmr, 30, 11-23. to be nearly the largest virtual organization within the life sciences. Lensink, M.F., Mendez, R. and Wodak, S.J. (2007) Docking and scoring protein This successful start has been underlined by the award for the best complexes: CAPRI 3rd edition, Proteins-Structure Function and Bioinformatics, 69, 704-718. demonstration of an application, received at the EGEE (Enabling McGuffin, L.J., Bryson, K. and Jones, D.T. (2000) The PSIPRED protein structure Grids for E-sciencE) 2009 User Forum. With the present-day prediction server, Bioinformatics, 16, 404-405. momentum, the WeNMR project is rapidly evolving into a factor Mendez, R., et al. (2005) Assessment of CAPRI predictions in rounds 3-5 shows of importance within structural biology, and life sciences in progress in docking procedures, Proteins, 60, 150-169. Purcell, E.M., Torrey, H.C. and Pound, R.V. (1946) Resonance Absorption by general. As such it has been the first Virtual Research Organization Nuclear Magnetic Moments in a Solid, Physical Review, 69, 37. officially recognized by the EGI. Currently, efforts include writing Rohl, C.A., et al. (2004) Protein structure prediction using Rosetta, Methods in WSDL definitions for the portals that will allow calling services Enzymology, 383, 66-93. remotely, e.g. from a workflow-manager. At the next stage of the Schwieters, C.D., Kuszewski, J.J. and Clore, G.M. (2006) Using Xplor-NIH for NMR project the different elements will be combined, providing molecular structure determination, Progress in Nuclear Magnetic Resonance Spectroscopy, 48, 47-62. comprehensive, yet easy-to-use tools for integrated analysis of Schwieters, C.D., et al. (2003) The Xplor-NIH NMR molecular structure NMR data. Furthermore, a number of SAXS services are being determination package, Journal of Magnetic Resonance, 160, 65-73. added to support a wider user community in structural biology. Shen, Y. and Bax, A. (2007) Protein backbone chemical shifts predicted from Up-to-date information, regarding the state of the project, the searching a database for torsion angle and sequence homology, J Biomol Nmr, 38, available services, and how to join the WeNMR virtual 289-302. Shen, Y., et al. (2009) TALOS plus : a hybrid method for predicting protein backbone organization, can be found on the project web page at torsion angles from NMR chemical shifts, J Biomol Nmr, 44, 213-223. http://www.wenmr.eu. Shen, Y., et al. (2008) Consistent blind protein structure generation from NMR chemical shift data, Proceedings of the National Academy of Sciences of the United States of America, 105, 4685-4690. ACKNOWLEDGEMENTS Shen, Y., et al. (2009) De novo protein structure generation from incomplete chemical shift assignments, J Biomol Nmr, 43, 63-78. The WeNMR project is funded by the European Commission van Dijk, A.D., et al. (2005) Data-driven docking: HADDOCK's adventures in under an FP7 e-Infrastructure grant, contract no. 261572 and builds CAPRI, Proteins, 60, 232-238. on the previous FP7 e-Infrastructure project e-NMR, contract no. Volk, J., Herrmann, T. and Wuthrich, K. (2008) Automated sequence-specific protein 213010. Support from the former EGEE and the current EGI in NMR assignment using the memetic algorithm MATCH, J Biomol Nmr, 41, 127- 138. terms of expertise and recognition is also acknowledged. The Vranken, W.F., et al. (2005) The CCPN data model for NMR spectroscopy: national Grid Initiatives of Belgium, Italy, Germany, the Development of a software pipeline, Proteins-Structure Function and Netherlands (via the Dutch Big Grid project), Portugal, UK, South Bioinformatics, 59, 687-696. Africa and the Latin America Grid infrastructure via the Gisela+ Wang, Y.J. and Jardetzky, O. (2002) Investigation of the neighboring residue effects on protein chemical shifts, J Am Chem Soc, 124, 14075-14084. project is acknowledged for the use of web portals, computing and storage facilities. Finally, the authors like to thank those that have expressed their interest in and support to the project. REFERENCES Berjanskii, M.V. and Wishart, D.S. (2005) A simple method to predict protein flexibility using secondary chemical shifts, J Am Chem Soc, 127, 14970-14971. Berman, H.M., et al. (2000) The Protein Data Bank, Nucleic Acids Research, 28, 235- 242. Bloch, F. (1946) Nuclear Induction, Physical Review, 70, 460. Case, D.A., et al. (2005) The Amber biomolecular simulation programs, J Comput Chem, 26, 1668–1688. Cornilescu, G., Delaglio, F. and Bax, A. (1999) Protein backbone angle restraints from searching a database for chemical shift and sequence homology, J Biomol Nmr, 13, 289-302. De Vries, S.J., et al. (2007) HADDOCK versus HADDOCK: New features and performance of HADDOCK2.0 on the CAPRI targets, Proteins-Structure Function and Bioinformatics, 69, 726-733. Copyright © 2011 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors.