3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 A Pipeline Pilot based SOAP implementation of FlexScreen for High-Throughput Virtual Screening Horacio Pérez-Sánchez1, Ivan Kondov2, José M. García1, Konstantin Klenin2and Wolfgang Wenzel3,* 1 Computer Engineering and Technology Department, University of Murcia. Murcia (Spain) 2 Steinbuch Centre for Computing, Karlsruhe Institute of Technology. Karlsruhe (Germany) 3 Institute of Nanotechnology, Karlsruhe Institute of Technology. Karlsruhe (Germany) ABSTRACT approach of doing scientific research is related to its complexity Methods for in-silico screening of large databases of molecules and difficulty of use making the learning curve too steep. Many increasingly complement and replace experimental techniques to efforts have then to be made to hide the complexity embedded in discover novel compounds to combat diseases. As these techniques “the Grid” and to provide high-level services that allow scientists become more complex and computationally costly we are faced with to take more effectively further advantage of the distributed re- an increasing problem to provide a community of life-science re- sources. searchers with a convenient way to run complex high-throughput Science gateways are the primary solutions dedicated to bridge virtual screening (HTVS) calculations on distributed computing such knowledge gaps. A Science gateway is defined as “a commu- resources. To this end, we recently integrated the biophysics based nity developed set of tools, applications, and data that is integrated drug screening methodology FlexScreen into a service applicable for via a portal or a suite of applications, usually in a graphical user large-scale parallel screening and reusable in the context of scientif- interface, that is further customized to meet the needs of a targeted ic workflows. Our implementation, based on Pipeline Pilot and community” (Catlett, 2002; Catlett, 2005). With science gateways SOAP provides an easy-to-use graphical user interface to construct non-grid-aware users can use grid infrastructure to run shared, complex workflows which are executed on distributed computing well-tested applications customized for their own research field. resources, thus accelerating the throughput by several orders of Generally these solutions contain a set of research-specific applica- magnitude. tions developed by (and for) the community, and provide services integrated in a unified user interface, usually a web portal or a stand-alone graphical user interface. In the context of HTVS this 1 INTRODUCTION problem is paramount because the target user community consists of pharmacists and biologists not trained or experienced in the use The discovery of new drugs can be drastically accelerated with the of HPC/grid infrastructures. use of high-throughput virtual screening (HTVS) methods Very often, science gateways provide special higher-level ser- (Friesner, et al., 2004; Halgren, et al., 2004; Meng, et al., 1992; vices for construction and execution of scientific workflows, i.e., Merlitz, et al., 2003; Merlitz and Wenzel, 2002; Merlitz and means to automate processing of multiple steps in parallel or in a Wenzel, 2004) ongoing trend in medical research taking advan- sequence, including branching and loops. Thus, workflows are tage of recent advances introduced in the field. In order to identify abstract logical maps of the complex simulation protocols. Scien- promising candidates for new drugs, chemical compound databases tific workflows require each step (often a different scientific appli- with millions of ligands (Irwin and Shoichet, 2005) need to be cation) to provide common interfaces for execution and data ex- screened using HTVS against structurally resolved receptors and change. Currently, several systems for workflow management are hence the access to computational resources becomes a serious employed in different projects. For example, the UNICORE issue. Many research organizations have access to high perfor- workflow engine has been used in the area of QSAR/QSPR (Sild et mance computing (HPC) resources distributed in computing grids al. 2005), Gridbus for brain imaging (Pandey et al. 2009). Other and clusters, which can tremendously help to overcome these very widely used workflow systems are Kepler (kepler-project.org) constraints (Perez-Sanchez and Wenzel, 2011). and Taverna (taverna.org.uk). For a review on scientific workflows HPC resources consist in a wide range of hardware and software we refer to (Yu et al. 2005). resources for the research group members. They are usually ac- In order to make HTVS methods accessible for the relevant cessed through well-defined gateways, which are based on web community, we must (a) integrate the screening method into an services or remote-access user interface machines (UIs). However, easy-to-use graphical interface (b) the interface must be reusable in both solutions still require in-depth knowledge in grid technologies different scientific workflows in combination with other applica- from the non-expert end users. The major drawback of this direct tions and (c) provide a seamless access to large-scale computation- al resources to enable large screening campaigns. In this work we * To whom correspondence should be addressed. present a solution for the HTVS application FlexScreen which Copyright @ 2011 for the individual papers by the papers’ authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 takes into account these three aspects. In Section 2 we introduce Pipeline Pilot provides several integration methods so that several appli- the program FlexScreen as well as the methods we employed to cations existing either in the workflow server, remote server or cluster can integrate FlexScreen into workflows for HTVS. In Section 3, we be executed automatically in a workflow. Pipeline Pilot provides also data will particularly describe how we adopted Pipeline Pilot and the integration tools that assist in the assembly of information from different formats and pertaining to different databases. A convenient and intuitive SOAP standard to implement our concept and present a case study graphical user interface via a web browser is provided for constructing and with use of the developed machinery. In Section 4 we will con- executing the workflows. The workflows are assembled using modules that clude and give an outline of future work. are represented as icons in the graphical user interface. The workflows are actually stored in an XML format and can be easily exchanged between users. The modules, called components, include a variety of data readers, 2 METHODS manipulators, calculators, data viewers, and data writers. For example, there are convenient data reading modules for ISIS files, SD-files, and SMILES, as well as delimited text and Excel spreadsheet files. Data view- 2.1 FlexScreen ers and writers include standard applications, such as WebLabViewerPro HTVS calculations have been performed with the all-atom receptor−ligand and Spotfire. An HTML molecular table viewer provides a convenient way docking program FlexScreen (Guerrero, et al., 2011; Merlitz, et al., 2003; to view tabular results with chemical structures. Although the applicability Merlitz and Wenzel, 2002), which employs a force-field based scoring of the pipelining provided by this software is generic, the numerous (>200) function (similar to Autodock (Morris, et al., 1996)) and a Monte-Carlo specific components provided by SciTegic are heavily geared toward based search algorithm based on the stochastic tunneling method (Wenzel chemoinformatics environments. For academic users there is a free version and Hamacher, 1999), which has the advantage that it suffers only a com- of Pipeline Pilot available. paratively small loss of efficiency when an increasing number of receptor degrees of freedom is considered. A physical model is implemented which takes implicitly into account the 2.3 Workflows and Data Pipelining influence of the solvent in the interaction between ligands and proteins. The A workflow in Pipeline Pilot refers to the way a protocol is defined, usually free energy of the system includes vacuum contribution that has been in form of several disconnected pipelines, each of which is made of com- previously available in FlexScreen as well as additional solvation terms for ponents joined by pipes. A component refers to an individual operation to the individual species and for the complex as a linear sum of atomic para- be performed on a set of data records. The order of execution depends on meters (Eisenberg, et al., 1984). This latter model has the advantage that it the order the components are joined since the protocols are executed from is faster than other methods presently used and still has proven to be rea- left to right, top to bottom. sonably accurate. The solvent accessible surface area of the molecules must In the specific form of a workflow called data pipelining, records are be determined, which is a computationally intensive task, and in this work passed individually down the pipes. Data pipelining allows the automation an exact and an approximated, but less time consuming approach are of the HTVS process and the integration of several related modeling and presented. The other main contribution of this approach is the determina- database packages. Thus, in addition to orchestration of multiple workflow tion of the weight parameters for very different atom and bond types, being steps the data pipelining provides means for seamless data exchange be- them derived from experimental partition coefficients data in the cases tween the individual application modules. The end users’ work in HTVS octanol−water and gas−water. projects can be enormously facilitated by the exploitation of already pre- pared sets of commonly used collections of tasks in the form of workflows. These protocols can be later deployed on HPC resources in a simple and 2.2 Pipeline Pilot automated fashion. An advantage of the pipelining approach is the ability to Pipeline Pilot (http://www.scitegic.com) provides services and a workflow capture and conveniently share workflows for better reuse. engine basing on Service Oriented Architecture (SOA) (Yang, et al., 2010) allowing very effective workflow life-cycle management, i.e. it ensures maximum reuse of already integrated modules. In addition, it supports 3 IMPLEMENTATION AND USE CASES SOAP with Web Services Description Language (WSDL) extensions for efficient decoupling of workflow management from services’ internal 3.1 Pipeline Pilot Modules for FlexScreen implementation. In this way, in addition to its built-in functionality, the FlexScreen was initially designed as a standalone command line architecture of Pipeline Pilot has been organized for integration and exten- application. In the first part of the work reported here we have sibility and designed to interoperate with external software objects and implemented a set of Pipeline Pilot modules that are required to applications. A number of mechanisms are available to automate the execu- run FlexScreen within Pipeline Pilot. The required executables and tion of a remote program. Additional options are available if the screening template configuration files are placed in the Pipeline Pilot server. code resides on the workflow server. In general, two mechanisms are used for remote execution. Simple integrations use Telnet and File Transfer The FlexScreen integration in Pipeline Pilot is depicted in Fig. 1. Protocol (FTP). More complex integrations use Simple Object Access In pipelines 1 and 2 end users need to specify receptor and ligand Protocol (SOAP)(Snell, et al., 2002) and web services. SOAP provides a database files in the molecular standard PDB format. If the user way for applications to communicate with each other over the HPC re- works with other molecular formats (smi, sdf, etc.), the protocol sources. The SOAP framework is independent of any particular program- can be easily modified using molecular format converters included ming model, environment, or language. It is a structured method for sharing in the standard components collection of Pipeline Pilot. Afterwards messages between server and client, and relies on XML to define the the initial receptor and ligand files can be parameterized depending format of the information and then adds the necessary HTTP headers to the on the charge model used, hydrogen model, etc. and additional information. Most applications do not deal directly with the underlying components (pH, tautomers, etc.) can also be easily included in the SOAP data structures. Instead, they use a toolkit specific to their program- pipeline. Once the molecules are ready for the HTVS calculations, ming language and operating system. The toolkit simplifies the process of making SOAP calls and processing the returned results. the docking parameters (degree of flexibility, simulation length, physical model, etc.) and parallel calculation parameters (batch 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 8 JUNE 2011 Figure 1: Integration of FlexScreen into Pipeline Pilot workflows. Pipe- Pip Figure 2: Sample of the output results in HTML format, directly from the lines 1 and 2 read and format the ligand database and receptor files. In web browser. HTVS results ults are presented in consecutive rows for the Pipeline 3 the input molecules are received and the docking simulation different ligands of the database. Different columns contain information parameters are specified. Then the FlexScreen en component performs the about each ligand regarding name, energy calculations, RMSD, etc. Click- Clic SOAP calls and runs the calculations on the HPC resources. Finally the ing on each ligand 2D representation opens a new window with detailed deta results are processed and presented in an interactive table format. information about the 3D ligand binding mode as shown in Figure 3. size, number of processors to use, etc.) are specified at the begin- begi 4. Read the resulting ing files and pass them back as a SOAP ning of the third pipeline. In any ny case the protocol provides default message to the calling component. A report on the results parameters for all the components so that the user only needs to will be automatically matically prepared as an interactive HTML select ligand, receptor and binding site parameters. report, a PDF document, or a spreadsheet. One of the challenges in a virtual screening experiment is to ana- an lyze and organize the returned results. Again, an expert modeler 3.3 Examples of Use will be familiar with tools available within a modeling enviro environ- Results from a HTVS calculation performed by an end user are ment to examine and filter the results. For an end user, the analysis shown in Figs. 2 and 3. As seen in Fig. 2 the resulting data is clear- and presentation must be automated so that they can correctly ly organized in tables which are directly opened in the web brow brows- generate the information that they need for further decision mak- ma er after the screening calculations. The user can control the degree ing. Using a single PC as a server, a single user is thus able to of detail in the final report interacting with the “table parameters” design and run application workflows that link all available Pipe- Pip component as well as reorganize easily and sort the final data with line Pilot modules with FlexScreen for HTVS. a few mouse clicks in the web browser. There is alsoals the possibili- 3.2 SOAP Implementation of FlexScreen ty of exporting the results to other standard formats, i.e., PDF, The integration in Pipeline Pilot alone is,, however, insufficient for Word, Excel spreadsheets, CSV text files, etc. really large in-silico screening campaigns.. The improved accuracy From the perspective of users’ users experience, we found that the of FlexScreen comes at the price of the computation cost of the access to well-developed developed and validated workflows using underlying biophysical model. Therefore, we have implemented FlexScreen encourages the user to test and explore new ideas. the FlexScreen Pipeline Pilot modules as a SOAP-based SOAP (Snell, et Informal discussions sions with users who have performed HTVS calcu- al., 2002) service capable to run on large distributed architectures, architectures lations with FlexScreen in this way confirms that the deployment such as computing grids and clouds.. We have developed SOAP- of HTVS methods does not just get the same answers faster faster, but based web services for the remote FlexScreen Screen application using that scientists entists end up asking many more “what “what-if” questions and software such as Apache / Tomcat (http://tomcat.apache.org) tomcat.apache.org) or the running many more experiments than they would have done when Perl SOAP::Lite module (http://soaplite.com) soaplite.com). The SOAP wrapper a modeler had to be involved d in each case. contains sufficient processing functionality to perform the follo follow- ing tasks: 4 CONCLUSIONS AND FUTURE FUTU WORK 1. Receive a batch of ligands and receptor file as a SOAP message and save them to a file. One of the advantages of using SOAP is that it allows a batch size to be spe- sp In this paper, we have described the implementation of a HTVS cified, allowing the collation a series of individual docdock- methodology in a science gateway way environment making use of the ing requests in a single request for efficiency. workflow environment provided by Pipeline Pilot. The solution 2. Receive complementary information as SOAP messages basing on SOAP and web services enables the exploitation of and save it to files, e.g., protein active site, configuration distributed HPC resources (grid computing). computing) The only drawback of files related to simulation parameters, parameters etc. Pipeline Pilot is its commercial license for non-academic non users. 3. Execute FlexScreen on the server and HPC resources us- Now we are exploring several open source alternatives. ing the files previously created. 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 8 JUNE 2011 Figure 3: 3D representation of the HTVS results obtained for two different receptor-ligand pairs (PDB PDB IDs 1gj4, 2bq6 and 2bqw). Blue color denotes the experimental ligand binding mode, orange color the FlexScreen Screen prediction without consider considering solvation and the red color the prediction with the consideration of solvation. ACKNOWLEDGEMENTS This research was supported by a Marie Curie Intra European Fellowship within the 7th European Community Framework Pro- Pr gramme (FP7 IEF INSILICODRUGDIS INSILICODRUGDISCOVER), the Funda- ciónSéneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under grants 00001/CS/2007 and 15290/PI/2010 and a postdoctoral contract from the University of Murcia (30th Decem- ber 2010 resolution). I. K. acknowledges gratefully continuous support and funding by Programme “Supercomputing” of the Helmholtz Association. REFERENCES Catlett, C. (2002) The philosophy of TeraGrid: Building an open, extensible, distri-distr buted TeraScale facility, Ccgrid 2002: 2nd Ieee/Acm International Symposium on Cluster Computing and the Grid, Proceedings, 479 479, 8-8. Catlett, C.E. (2005) TeraGrid: A foundation for US cyberinfrastructure, cyberinfra Network and Parallel Computing, Proceedings, 3779, 11-1. Eisenberg, D., et al. (1984) Analysis of membrane and surface protein sequences with the hydrophobic moment plot, Journal of molecular biology, 179, 125 125-142. Friesner, R.A., et al. (2004) Glide: lide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy, Journal of medicinal chemistry, 47, 1739-1749. Guerrero, G., et al. (2011) Effective Parallelization of Non-bonded Non Interactions Kernel for Virtual Screening creening on GPUs. In Rocha, M., et al. (eds), 5th International Conf Confe- rence on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011). Springer Berlin / Heidelberg, pp. 63 63-69. Halgren, T.A., et al. (2004) Glide: a new approach for rapid, rap accurate docking and scoring. 2. Enrichment factors in database screening, Journal of medicinal chemi- chem stry, 47, 1750-1759. Irwin, J.J. and Shoichet, B.K. (2005) ZINC--a ZINC free database of commercially available compounds for virtual screening, Journal of chemical information and modeling, 45, 177-182. Meng, E.C., Shoichet, K. and Kunz, I.D. (1992) Automated Docking with Grid-Based Grid Energy Evaluation, J.Comp.Chem., 13, 505. Merlitz, H., Burghardt, B. and Wenzel, W. (2003) Stochastic tunneling method for high throughput database screening. Nanotech 2003, Vol 1. Merlitz, H. and Wenzel, W. (2002) Comparison of stochastic optimization methods for receptor-ligand ligand docking, Chemical Physics Letters, 362, 271-277. 271 Merlitz, H. and Wenzel, W. (2004) High throughput in in-silico screening against flexible protein receptors. In Lagana, A., et al. (eds), Computational Science and Its Applications - Iccsa 2004, Pt 3. pp. 465-472. 465 Morris, G.M., et al. (1996) Distributed automated docking of flexible ligands to proteins: parallel applications of AutoDock 2.4, J Comput Aided Mol Des, 10, 293-304. Pandey S, Voorsluys W, Rahman M, et al. A grid workflow environment for brain imaging analysis on distributed systems Chen J, Cafaro M, eds. Concurrency Computat.: Pract. act. Exper. 2009;21(16):2118-2139. 2009;21(16):2118 3rd International Workshop on Science Gateways for Life Sciences (IWSG 2011), 8-10 JUNE 2011 Perez-Sanchez, H. and Wenzel, W. (2011) Optimization Methods for Virtual Screen- ing on Novel Computational Architectures, Current Computer-Aided Drug De- sign, 7, 44-52. Sild S, Maran U, Romberg M, Schuller B, Benfenati E. OpenMolGRID: Using Automated Workflows in GRID Computing Environment. In: Sloot P, Hoekstra A, Priol T, Reinefeld A, Bubak M, eds. Advances in grid computing -- EGC 2005.Vol 3470. Springer Berlin / Heidelberg; 2005:464-473. Snell, J., Tidwell, D. and Kulchenko, P. (2002) Programming Web services with SOAP. O'Reilly & Associates, Sebastopol, CA. Wenzel, W. and Hamacher, K. (1999) Stochastic tunneling approach for global minimization of complex potential energy landscapes, Physical review letters, 82, 3003-3007. Yang, X.Y., Bruin, R.P. and Dove, M.T. (2010) Developing an End-to-End Scientific Workflow. A Case Study Using a Comprehensive Workflow Platform in e- Science, Computing in Science & Engineering, 12, 52-61. Yu J, Buyya R. A taxonomy of scientific workflow systems for grid computing. SIGMOD Rec. 2005;34:44-49.