10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


          A Generic Framework and Methodology for
         Implementing Science Gateways for Analysing
                 Molecular Docking Results
                                        Damjan Temelkovski, Tamas Kiss, Gabor Terstyanszky
                                            University of Westminster, London, UK
                       damjan.temelkovski@my.westminster.ac.uk, {t.kiss, g.z.terstyanszky}@westminster.ac.uk


    Abstract—Molecular docking and virtual screening                       gateways, such as the MosGrid Portal [2], the AutoDock
experiments require large computational and data resources and             Gateway [3], and the AMC Docking Gateway [4]; as well as
high-level user interfaces in the form of science gateways. While          non-workflow-based pipelines such as the virtual screening
science gateways supporting such experiments are relatively                environment for Windows Azure [5], the supercomputer-based
common, there is a clearly identified need to design and
                                                                           [6] or the Linux cluster-based [7] virtual screening pipelines.
implement more complex environments for further analysis of
docking results. This paper describes a generic framework and a            However, there is still a need for more complex environments
related methodology that supports the efficient development of             that enable scientists to access a wide range of computing,
such environments. The framework is modular enabling the                   data and network resources for the further analysis of docking
reuse of already existing components. The methodology is agile             results. Such environments should support complex scenarios
and encourages the input and participation of end-users. A                 where intelligent support can be provided for the more
prototype implementation, based on the framework and                       efficient execution of large-scale molecular docking
methodology, of a science-gateway-based molecular docking                  experiments.
environment for recommending a ligand-protein pair for next
docking experiment is also presented and evaluated.                        This paper investigates such scenarios and proposes a generic
                                                                           conceptual framework to support the analysis of molecular
    Keywords—bioinformatics; modelling; molecular docking;                 docking results, and a related methodology that uses regular
science gateway; virtual screening.                                        input from scientists when developing complex science-
                         I. INTRODUCTION                                   gateway-based environments for the storage, analysis and
                                                                           reuse of molecular docking results. It has been developed
Molecular docking is a computational simulation that models
                                                                           considering biomedical scientists’ requirements collected from
biochemical interactions to predict where and how two
                                                                           semi-structured interviews and a literature review of 14 related
molecules would bind. Large-scale molecular docking
                                                                           projects including those mentioned in the paragraph above.
simulations are used in areas such as drug discovery where
                                                                           From this generic framework, specific architectures can be
they can decrease the amount of wet-lab experiments required.
                                                                           derived supporting various molecular-docking-related
Since molecular docking uses the structure of the receptor,
                                                                           analytical scenarios as shown in Section II. Additionally, a
large-scale molecular docking of hundreds of thousands of
                                                                           software development methodology that supports creating
ligands and one receptor is called structure-based virtual
                                                                           docking experiments based on this framework is explained in
screening (virtual as opposed to the robotics-based high
                                                                           Section III. Finally, a prototype implementation of such
throughput screening). Although a single docking simulation
                                                                           system is presented in Section IV.
is relatively short, a typical virtual screening experiment, that
may combine thousands of simulations, is computationally
demanding, requiring the use of Distributed Computing                      II. GENERIC FRAMEWORK FOR THE ANALYSIS OF MOLECULAR
Infrastructures (DCIs). Utilising and accessing such                                         DOCKING RESULTS
computational resources adds an extra level of complexity to               The aim of our research was to identify potential similarities in
the task making it increasingly difficult for biomedical                   the work of biomedical scientists working with molecular
scientists. Science gateways are widely utilised in this area to           docking experiments, and to investigate whether a generic
help bridging this gap.                                                    framework for such application scenarios can be defined. The
                                                                           assumption was that based on this generic framework more
Although this field has seen great advancements recently,                  specific science gateway based environments can be
feedback from biomedical scientists shows that there is still a            implemented supporting different application scenarios. As
significant gap to bridge. Examples for science gateways                   these scenarios have large similarity, deriving and
supporting molecular docking and virtual screening                         implementing such specific environments can be speeded up
experiments include several WS-PGRADE/gUSE [1] based                       significantly. In other words, the aim was to formalise and
This work was partially supported by the COLA Cloud Orchestration at the
level of Applications project, Project No. 731574.
                        10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


speed up the development of specific science gateway               in the MDRR, communicate with other ATs, or refer to data
environments supporting various molecular docking scenarios.       stored in an Additional Data Source.
In order to identify typical user requirements, several            Additional Data Source (ADS): It contains data that is
interviews with five scientists from different backgrounds and     relevant for the final decision and usually is an external
with various degrees of experience with molecular docking          database.
simulations were conducted. Since the number of the
interviewees was small and the population localised in London,     Decision Maker (DM): All the information processed from
this is a not a representative sample of the world-wide            the various ATs is passed to a DM. This element groups and
population of scientists that use molecular docking simulations.   analyses the calculations performed by the ATs in order to
However, considering its diversity, the sample was useful in       make a decision.
producing several conclusions. The interviews aimed at
identifying requirements of the scientists when performing         The numbers in Figure 1 present the order or flow of events
molecular docking experiments and specifying scenarios that        through the different elements:
are not supported by currently available science gateways for      1.  A scientist uses an MDE to conduct the molecular
molecular docking. These scenarios typically represent                 docking and the result is uploaded to the MDRR.
software systems that make a decision based on the molecular       2. The MDRR sends the results to one or more ATs.
docking results, mimicking the steps that a scientist needs to     3. An AT may communicate with one or more other ATs.
take after obtaining the results. Some representative and
                                                                   4. An AT may look up data stored in the ADS.
identified scenarios are listed below:
                                                                   5. An AT may require additional previous molecular
1.   Suggest a ligand-protein pair that should be used in the          docking results as input for its calculation.
     next molecular docking, based on protein similarity and       6. An AT would provide its calculation results to the DM.
     previous results                                              7. The MDRR may use data from the ADS directly.
2.   Filter docking results which are suitable for wet             8. Previous results from the MDRR may be used by the DM
     laboratory experiments, based on ligand properties            9. The DM may use data from the ADS directly.
3.   Find off-target drugs, based on deducing if the estimated     10. Once the analysis is complete and the decision is made, it
     binding is at an active site                                      can be passed back to the MDRR.
4.   Enable verification of the docking methodology and            11. Finally, the decision is passed to the MDE to visualise it.
     learning from previous docking for novice users
5.   Compare results from different molecular docking tools
Based on the conceptual similarities of these scenarios and an
extended review of literature, a generic framework has been
designed. The design focuses on the similar elements in the
scenarios and includes the following components (see Figure
1):
Molecular Docking Environment (MDE): All scenarios
include an environment where the molecular docking
simulation is executed. It could be as simple as running a
single simulation from the command line on a local computer,
to more complex such as executing a virtual screening
experiment on a DCI. This environment includes the software
tool used for the docking itself, and may also include
additional elements to connect to a DCI or to provide a high
level user interface.
Molecular Docking Results Repository (MDRR): After the                  Figure 1 – Basic diagram of the Generic Framework
execution of the molecular docking, the results need to be         From this generic framework each specific scenario
stored as previous molecular docking results are needed by         introduced earlier, and also the ones covered in the literature
various scenarios. The repository should also store                review can be derived. For illustration, a basic architecture
information about the final decision made by the whole             diagram for the first scenario is shown in Figure 2. Similar
simulation environment.                                            figures for each scenario have been designed and analysed
Additional Tool (AT): The results which have been stored in        demonstrating that the framework is generic enough to support
the MDRR are then processed by an AT. This is a generic            at least the five identified scenarios and the 14 related
element that describes a tool which takes one or more              solutions covered in the literature. However, these figures are
molecular docking results as input and conducts a calculation.     not presented here due to limitations in length of the paper.
ATs can refer back to other molecular docking results stored
                        10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


In Scenario 1 (Figure 2) the framework would analyse               to be assessed (6) and the good results are sent to the DM. The
previous molecular docking results and look for good docking       DM combines the results from the ATs, and suggests which
results that have used a receptor similar to the currently used    protein-ligand pair to dock as a next step. This suggestion is
receptor. Based on this, the system suggests a new protein-        returned to the MDRR and stored as meta-data (8). Finally, it
ligand pair that would be an interesting candidate for docking.    is presented to the user (9).
Two key issues here are the definitions of good docking result
                                                                   Based on the basic generic architecture of Figure 1, a more
or similar receptor.
                                                                   detailed framework has been developed that consist of a
In Figure 2 the building blocks of the Generic Framework           diagram, a textual description of elements and interfaces, and
have been replaced with concrete elements supporting this          a formal description using Z-notation [11]. The aim of this
particular scenario. One of the advantages of this modular         framework is to describe the generic architecture and the way
design is that these building blocks can be easily replaced with   how the specific scenarios are derived from this in a
other elements if necessary. This way multiple existing tools      formalised way. Based on this formalism we aim to support
can be integrated into the scenario design and evaluated,          application developers to make specific decisions when
requiring only the implementation of components that are not       evaluating and implementing these scenarios. The designed
currently available. Mapping of the generic framework for          framework is independent from the actual implementation, or
this particular scenario in the presented example is as follows:   indeed, the programming language of choice.
The MDE is an extended version of the popular Racoon2 [8]          The diagram representing the framework in Figure 3 is a
desktop application, a virtual screening environment. The WS-      generic model, showing all generic elements and all possible
PGRADE/gUSE science gateway framework was integrated               interfaces between them. It is based on the UML Component
with Raccoon2 to support large-scale experiments on                Diagram in the sense that the elements are drawn as
heterogeneous cloud computing resources, as it was presented       components and the interfaces between them are the typical
in [9]. The MDRR is a custom-made repository based on a            provided and required interface connections. Additionally, it
MongoDB database. Three ATs are utilised in this scenario.         features arrows pointing towards the direction of the flow of
The structural alignment tool DeepAlign [10] is used to            data in a particular interface.
calculate similarities between receptors. A custom-made AT is
used to assess whether the structural alignment result means
that the two receptors are similar, while another custom-made
AT is required to assess a docking result and categorise it as
good. Finally, a custom-made DM is needed to suggest which
protein-ligand pair to dock next.


                                                                              Figure 3 – Generic Framework diagram
                                                                   The framework features 13 interface types between its
                                                                   elements. As next step, each of these interfaces have been
                                                                   identified and described. For example:
                                                                   1.   User → MDE, provided by the MDE: allows the user to
                                                                        upload the correct input for the molecular docking or
                                                                        additional user input values needed by another element.
   Figure 2 – Basic diagram of scenario to suggest a ligand-
                                                                   2.   MDE → user, provided by the MDE: displays the result
          protein pair for next docking (Scenario 1)
                                                                        of the molecular docking to the user, along with other
The flow of events is shown in Figure 2. Raccoon2 executes              results from the MDRR.
the molecular docking and the results are uploaded to the
                                                                   Following this, each element and each interface have been
MDRR (1). The MDRR sends the receptor pairs to DeepAlign
                                                                   described formally using Z-notation. As the set of descriptions
(2). The results of DeepAlign are assessed by the custom-
                                                                   is too extensive for this paper, only a representative example is
made AT (3) that sends the results to the MDRR (4) and the
DM (5). All past docking results of similar receptors are sent
                        10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


presented here, describing the MDE and its interfaces (see         extended Raccoon2 as an MDE, and corresponding to that part
Figures 4 and 5).                                                  of the basic diagram of Figure 2. In Figure 7 the formal
                                                                   description of this module is shown. (Please note that full
The docking process expressed by the MDE needs a ligand,
                                                                   diagram and description are not provided due to limitation of
receptor, and optionally configuration (config) files as input,
                                                                   length, but has been produced.)
and provides a docking result file as output. When there is no
config file then the dockingWithoutConfig() function will
generate the docking result, while when there is a config file
then the dockingWithConfig() function will do it. Furthermore,
the Z-notation for dockingWithoutConfig() describes that for
every ligand × receptor pair, as long as the ligand and receptor
are not empty files, there exists a docking result. Similarly,
dockingWithConfig() defines that for each ligand and for each
receptor there exists a configuration file that can be used to
produce a docking result. The corresponding Z-notation
descriptions can be seen in Figure 4.


                                                                      Figure 6 – Extract of the detailed architecture diagram of
                                                                                              Scenario 1


           Figure 4 – MDE Described in Z-notation
Figure 5 models the MDE and its interfaces for the three types
of input files. This schema explains that the ligand, receptor,
and config files are input, while the docking results as well as
data about the date are produced as output. The lower part of
Figure 5 describes the interface that enables users to view
results, as long as they are not non-existent.


                                                                      Figure 7 – Extract of the formal description of Scenario 1

                                                                   III. METHODOLOGY FOR DEVELOPING ENVIRONMENTS FOR THE
                                                                           ANALYSIS OF MOLECULAR DOCKING RESULTS
 Figure 5 – Interfaces of the MDE described with Z-notation        This section describes the methodology for developing
Based on the above detailed description of the generic             complex environments that reuse and analyse molecular
framework, a detailed architecture diagram of each scenario        docking results. This methodology complements the
can now be derived followed by the textual and formal              framework described in the previous section by explaining
descriptions of these scenarios. Figure 6 shows part of the        how this framework can be used during development. It
detailed architecture diagram of Scenario 1, representing the      clearly states the roles that are required and the specific sub-
                        10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


projects for which they need to collaborate. The methodology       zip files as POST parameters. It parses them and inserts
is based on the seven principles identified by Cockburn [12].      information into MongoDB, which includes the collections
                                                                   receptors, ligands, results, and analysis. Another request is
Based on Cockburn’s general recommendations, a role-
                                                                   sent to continue with Scenario 1 where the MDRR selects all
deliverable-milestone diagram has been created to represent
                                                                   receptors from the database, parses and compresses them.
the methodology (Figure 8). This diagram illustrates that the
                                                                   Next, these are sent to Server 2 along with the target receptor
modeller, biomedical scientist and bioinformatician should
                                                                   (the receptor used in the original simulation), and a threshold
collaborate when creating the diagram and textual description
                                                                   value (input by the user in Raccoon2). The first AT on Server
of the scenario. Furthermore, the modeller should collaborate
                                                                   2 executes DeepAlign to find similarities between the target
with the bioinformatician and the software developer when
                                                                   receptor, and each different receptor it received. It then calls
creating the formal description. Key components of this
                                                                   the AT: AssessDeepAlign, located on the same server, in order
diagram, extensions to Cockburn's original model, are the
                                                                   to select the similar receptors. In the simplest form of this AT,
dotted lines which show that the process is agile. For instance,
                                                                   it assesses the DeepAlign results by comparing the value of
in the top section where the life scientist works on the textual
                                                                   DeepScore to a user input threshold. A list of these similar
description and go from milestone M4 to M5, there is a dotted
                                                                   receptors is returned to Server 1 where the analysis collection
line showing that (s)he could revisit and alter the diagram if
                                                                   is updated to keep track of the events so far. Then, the MDRR
necessary. The same logic is used for the agile development of
                                                                   selects past docking results which have used one of the similar
the final system code. Figure 8 presents a high level role-
                                                                   receptors, and compresses them. It sends a request to Server 3,
deliverable-milestone diagram where the coding section has an
                                                                   including a threshold value of the AutoDock Vina affinity,
asterisk (*) indicating that a similar but more detailed
                                                                   entered by the user within Raccoon2.
description of this section (not presented in this paper) has
also been developed in the form of a lower-level diagram.


                                                                      Figure 9 – Architecture of implementation of Scenario 1
                                                                   The AT on Server 3 searches through the Vina results for a
                                                                   result that has at least one model where the Vina affinity is
                                                                   less than the threshold, and calls this a good docking result (a
                                                                   Vina docking result can contain for example 10 models). It
                                                                   returns a list of good docking results to Server 1.
                                                                   Upon receiving this, Server 1 inserts a document in the
    Figure 8 – Role-deliverable-milestone diagram of the           analysis collection before initialising the DM and sending it
                  developed methodology                            the similar receptors and the good results. The DM combines
     IV. IMPLEMENTATION OF THE SELECTED SCENARIO                   these two lists into one and sorts it based firstly on the
                                                                   DeepScore value, then on the affinity. This enables users to
In order to demonstrate how the developed framework and            view an ordered list of results that contain ligands which are
methodology support implementing molecular docking                 suggested for a subsequent docking.
science gateways, an implementation of Scenario 1
(https://github.com/damjanmk/mdrr-scenarios) is presented          A. Designing the MongoDB database
here. All components in the implementation are accessible via      At the core of this custom-made MDRR is a MongoDB
a basic RESTful API. We used Bottle [13], a minimalist web-        database. There were several reasons why we chose this type
framework which enables easy server setup. The MDRR and            of non-relational database:
the DM have been deployed on Server 1, the DeepAlign AT            1. MongoDB’s schеma-less design is ideal because a single
and the AT to assess the DeepAlign results on Server 2, while          collection can be used for: input files in different formats,
the docking assessment AT on Server 3 (Figure 9). In order to          output files of any of the over 50 docking tools [14], or
insert results from Raccoon2, the MDRR on Server 1 expects             meta-data about different ATs.
                        10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


2.   MongoDB scales very well for large amounts of data,           environments for the execution of molecular docking
     provided it is well designed and features such as sharding    experiments extended with the intelligent analysis and
     and indexing are utilised.                                    utilisation of docking results. The framework incorporates a
3.   MongoDB is well-suited for prototyping because it is          diagram, and textual and formal description enabling a modular
     easier to change what is stored during development.           design and the replacement and reuse of components. The
                                                                   methodology involves multiple stakeholders and requires their
In this prototype implementation we have considered .pdbqt         collaboration in an agile manner. In order to demonstrate the
molecules and AutoDock Vina results (as used by Raccoon2).         usability of the above, a scenario for suggesting a ligand-
                                                                   protein pair for next docking was also presented.
The ligands collection contains molecular properties
calculated using the OpenBabel and PyBel [15] Python               Future work includes the implementation and detailed
modules such as canonical_SMILES, logP, mol_weight, etc.           evaluation of multiple scenarios to identify, and where possible
Biomedical scientists at the University of Westminster were        quantify, the advantages provided by the framework and
consulted when deciding which properties to store. Both the        methodology. In order to achieve this, the implemented
ligands and receptors collections include the full parsed 3D       solutions are compared to state-of-the-art methods and
structure from the .pdbqt files. Each line of the .pdbqt file is   environments to demonstrate the added value of our research.
stored as an element of an array. The structure of each
molecule should be unique. However, the structure itself                                         REFERENCES
cannot be uniquely indexed due to size limitations, so we have     [1]  P. Kacsuk et al., “WS-PGRADE/gUSE generic DCI gateway framework
introduced structure_id - an MD5 hash of the structure. This            for a large variety of user communities” J. Grid Comput., vol. 10, no. 4,
                                                                        pp. 601-630, Dec, 2012.
uniquely describes the structure and allows for a MongoDB
                                                                   [2] J. Krüger et al., “The MoSGrid Science Gateway – A Complete Solution
index to be created.                                                    for Molecular Simulations”, J. Chem. Theory Comput., vol. 10, no 6, pp.
                                                                        2232–2245, May, 2014.
The results collection contains references to the ligand and
                                                                   [3] Z. Farkas et al., “AutoDock gateway for user friendly execution of
receptor used, specific properties extracted from the result            molecular docking simulations in cloud systems”, in Cloud Computing
files (e.g. CPUs, random_seed), a list of the result models,            with E-science Applications, Olivier Terzo, Lorenzo Mossucca, Eds.
each model containing affinity, rmsd_from_best, and the                 Boca Raton, FL: CRC Press/Taylor & Francis, 2015, pp 217-236.
parsed model segment of the Vina result. The parsing process       [4] M. Jaghoori et al., “A multi-infrastructure gateway for virtual drug
                                                                        screening”, Concurr. Comp. Pract. E., vol. 27, no. 16, pp. 4478–4490,
is simple – it stores all lines between MODEL and ENDMDL                Nov, 2015.
as elements of an array.                                           [5] T. Kiss et al., “Large-scale virtual screening experiments on Windows
                                                                        Azure-based cloud resources”, Concurr. Comp. Pract. E., vol. 26, no 10,
B. Use of the framework and methodology                                 pp. 1760-1770, Jul, 2014.
The framework was followed as described in Section II. A list      [6] X. Zhang, S. E. Wong, and F. C. Lightstone, “Toward fully automated
of documented meetings and events is not presented with this            high performance computing drug discovery: a massively parallel virtual
                                                                        screening pipeline for docking and molecular mechanics/generalized
paper, but serves as supporting evidence of following the               born surface area rescoring to improve enrichment”, J. Chem. Inf.
methodology. The required roles were taken up by different              Model., vol. 54, no 1, pp. 324-337, 2014.
researchers at the University of Westminster (with some            [7] P. D’Ursi et al., “Virtual screening pipeline and ligand modelling for
doubling as multiple roles). The presented implementation               H5N1 neuraminidase”, Biochem. Biophys. Res. Commun., vol. 383, no.
proves that following the methodology such molecular                    4, pp. 445-449, Jun, 2009.
docking framework can be implemented. Work is currently            [8] S. Forli et al., “Computational protein-ligand docking and virtual drug
                                                                        screening with the AutoDock suite”. Nat. Protoc., vol. 11, no. 5, pp.
ongoing to quantify advantages when compared to more ad-                905-919, Apr. 2016.
hoc implementation.                                                [9] D. Temelkovski, T. Kiss, and G. Terstyanszky, “Molecular docking with
                                                                        Raccoon2 on clouds: extending desktop applications with cloud
C. Limitations of the prototype implementation                          computing”, in 9th International Workshop on Science Gateways,
Due to the Global Interpreter Lock (GIL), Python is not the             Poznań, Poland, 2017.
optimal language for multi-threading without additional            [10] S. Wang, J. Ma, J. Peng, and J. Xu, “Protein structure alignment beyond
                                                                        spatial proximity”, Sci Rep vol. 3, no. 1448, Mar, 2013.
optimisations. Furthermore, Bottle uses a non-threading type
                                                                   [11] J. M. Spivey, The Z Notation - A Reference Manual, 2nd ed. Oxford, UK:
of servers by default, so using a different specialised server          Oriel College, 1998.
would improve performance for simultaneous users. The              [12] A. Cockburn, Agile Software Development: The Cooperative Game, 2nd
number of items in the collections may become too big to be             ed. Boston, MA: Addison Wesley, 2006.
included in one zip file which is used to transfer data from       [13] M. Hellkamp. Bottle: Python Web Framework [Online]. Available:
servers and sending large files through the network could be a          https://bottlepy.org/docs/dev/. [Accessed: 6 Mar 2018]
bottleneck. Finally, the current DM joins and sorts two lists      [14] Swiss Institute of Bioinformatics. Directory of in-silico Drug Design
                                                                        tools - Docking [Online]. Available: https://www.click2drug.org
without specific performance optimisations.                             /directory_Docking.html. [Accessed: 6 Mar 2018]
            V. CONCLUSION AND FUTURE WORK                          [15] N. M. O'Boyle, C. Morley, and G. R. Hutchison, “Pybel: a Python
                                                                        wrapper for the OpenBabel cheminformatics toolkit”, Chem. Cent. J.,
This paper presented a generic framework and a corresponding            vol. 2, no. 5, Mar 2008.
methodology to implement complex science-gateway-based