=Paper= {{Paper |id=Vol-2363/paper6 |storemode=property |title=Efficient Mass Spectra Prediction through Container Orchestration with a Scientific Workflow |pdfUrl=https://ceur-ws.org/Vol-2363/paper6.pdf |volume=Vol-2363 |dblpUrl=https://dblp.org/rec/conf/iwsg/HanussekB0K17 }} ==Efficient Mass Spectra Prediction through Container Orchestration with a Scientific Workflow== https://ceur-ws.org/Vol-2363/paper6.pdf
                         9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017



          Efficient Mass Spectra Prediction
                       through
  Container Orchestration with a Scientific Workflow


     Maximilian Hanussek1,2,3, Felix Bartusch1,2,3,                                      Oliver Kohlbacher2,3,4,5
                  Jens Krüger1*                                        2
                                                                           Center for Bioinformatics, 3 Dept. of Computer Science,
                                                                           4
     1
      High-Performance and Cloud Computing Group                             Quantitative Biology Center, University of Tübingen
  Zentrum für Datenverarbeitung, University of Tübingen                                     Tübingen, Germany
                                                                            5
                     Tübingen, Germany                                        Biomolecular Interactions, Max Planck Institute for
             *                                                                  Developmental Biology, Tübingen, Germany
               jens.krueger@uni-tuebingen.de


    Abstract—The mass spectroscopic fragmentation of small              Most installation and computing environment problems can
molecules such as metabolites can be simulated with QCEIMS. In      be solved by providing a container for any desired tool. Almost
this paper we present our work dealing with the containerization    every required program can be wrapped into such a container,
of the complex and interdependent software stack. The               which saves time for installation and does not require special
simulation protocol has been mapped to a UNICORE workflow           permissions. The container is a self-contained environment, no
enabling convenient access to powerful computing resources. To      matter on which system it runs. This fact leads to a good
offer a maximum of convenience to the users a simple portal was     reproducibility of already achieved results and is especially
deployed hiding the complexity of technical details.                important in the natural sciences. Using Docker for distributing
                                                                    software stacks could be one approach to solve installation and
    Keywords—containerization; workflows; reproducibilty; science
gateway; mass spectrometry; quantum mechanics
                                                                    computing environment problems. But the user-friendliness
                                                                    concerning the operation of a complex tool can not be
                                                                    increased through it.
                       I. INTRODUCTION
                                                                        Another technology that can be used to increase the user-
    In the natural sciences, there are many software
                                                                    friendliness are workflows representing specific scientific
applications which are commonly used, but which are not easy
                                                                    protocols. In the meantime, many workflow platforms are
to install or to apply. Installation problems can originate from
                                                                    available such as KNIME [9], TAVERNA [10], Pipeline Pilot
special computing environments being required or the number         [11], [12], Galaxy [13], [14], and UNICORE [15]. The
of interdependent additional software packages that need to be      Uniform Interface to Computing Resources (UNICORE) is a
installed. Furthermore, many programs require the usage of          mature so-called middleware solution to create workflows and
command line interaction by the user. This knowledge is not         in addition, get access to distributed computing resources.
always present and should not be a prerequisite. But over time,
                                                                    UNICORE is used in many research fields and settings from
technologies have emerged that allow an easy installation and
                                                                    small projects up to large transnational projects like MoSGrid
operation of complex tools [1].
                                                                    [16], [17], the European PRACE infrastructure [18], the US
    One particular technology that gained popularity in recent      XSEDE Initiative [19] or the Human Brain Project [20]. An
years is container virtualization. Representatives of container     advantage of UNICORE is that it provides access to high-
virtualization methods based on the Linux system are Linux-         performance computing (HPC) clusters and file systems and
VServer [2], Docker [3], OpenVZ [4], Linux Container (LXC)          offers the possibility to generate workflows suitable for HPC
[5] and Singularity [6]. Among all these representatives,           environments.
Docker is the most prominent. Docker and its container-
                                                                       The interaction between Docker and UNICORE makes it
technology are a lightweight alternative to full virtual
                                                                    possible to simplify both, the installation process and the use of
machines. Since the virtualization is running on the host OS, it
                                                                    complex tools. In the following chapters, Docker and
is possible to run multiple applications in parallel without
                                                                    UNICORE are explained in more detail.
establishing a new kernel for each application, which makes
the container-based technique more lightweight than a
hypervisor-based approach [7], [8].
                            9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


                             II. DOCKER                                      as well as a workflow editor, a web interface (UNICORE-
There are two major concepts in the field of software                        portal) and a more advanced graphical user interface, the
                                                                             UNICORE rich client (URC). One advantage that UNICORE
virtualization, container-based virtualization and hypervisor-
                                                                             offers is that it is designed to abstract computing resource
based virtualization. Both multilayered approaches are                       specific details and through that simplifies the user experience.
illustrated in Figure 1. Examples for hypervisor-based                       Furthermore, it is extensible due to the use of standardized
virtualization software are VMware [21] or Xen [22]. A major                 APIs, which for example makes it possible to run KNIME
aspect is that a hypervisor-based virtualization establishes a               nodes on a HPC via UNICORE [24]. No special operating
full virtual machine on top of the host operating system. Such               system is required as UNICORE is completely written in the
a virtual machine has its own operating system (Guest OS)                    platform independent programming languages Java and
and own kernel. This virtualization technology provides a                    Python. Another not negligible aspect is the need to use a
virtualization on the hardware level.                                        certain safety standard to prevent the loss of sensible data. Due
                                                                             to different security and authorization methods offered by
                                                                             UNICORE the connection between client and server is
                                                                             considered to be save [15].




Fig. 1. Container-based approach including applications and the necessary
binaries and libraries building up on the Docker engine (left). Hypervisor
based approach with an additional guest OS on top of the hypervisor layer
(right).


    In contrast to the hypervisor-based virtualization, Docker
establishes a virtualization based on the host OS. The virtual
environments are directly run on the host kernel, which are
usually named containers [8]. It is possible to create own
Docker images, which serve as template for the Docker
containers. The images are created via the so-called Dockerfile,
which is a plain text file that specifies how the containers are
created and run. Docker images are built upon a base image
which can be any operating system that fits to the host OS, on a
Linux system for example Ubuntu or CentOS. Images consist
of a series of data layers on top of the base image. Worth                   Fig. 2. Overview of the different UNICORE components and their interaction
mentioning is that a variety of containers can be started from               with each other [15].
only one image, each container does not need its own image.
The already available images can be used as a new base image                     The UNICORE architecture consists of five layers (user-
and can be extended further [7]. This is simply done by adding               layer, gateway-layer, UNICORE/X-layer, TSI-layer, resource-
a new data layer which is more efficient than building the                   layer). The user-layer provides end-user clients and
whole image from scratch. To work with these multiple layers,                applications but also other UNICORE servers and web portals.
Docker uses the Union File System to merge the different                     The gateway-layer serves mostly as a firewall transversal point
layers into a single and consistent file system which is one of              and forwards information such as IP addresses or SSL
the underlying techniques. A Docker container provides a                     certificates via the connecting client to the following servers.
virtual environment for its contained applications by leveraging             The UNICORE/X is the central component of UNICORE. It
the Linux kernel features control groups (cgroups) for                       receives the client requests, which has been submitted via the
accounting processes of the container and namespaces for                     gateway, authenticates the request, checks the authorization
providing isolated instances of host resources [8].                          and in the end, invokes the appropriate service. The Target
                                                                             System Interface (TSI) is connected with the local operating
                       III. UNICORE                                          system, file system and usually a batch system for the resource
                                                                             management. The tasks of the TSI are for example to submit
    The UNICORE software package is developed at the
                                                                             the sent jobs from the client, check the status of the jobs, or
research center in Jülich and by further partners [15], [23]. It
                                                                             perform the I/O operations. A schematic illustration of the
provides different components for handling HPC environments
                                                                             different layers is shown in Figure 2.
                              9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


                             IV. QCEIMS                                           UNICORE workflow. Further Docker is used to encapsulate
    The fragmentation of small molecules as it occurs within a                    and execute all QCEIMS calculations in Docker containers.
mass spectrometry experiment can be simulated with quantum                        The created Docker image contains all necessary software to
chemical simulations. The method called QCEIMS (Quantum                           execute the QCEIMS calculations. These tools are listed in
Chemistry Electron Ionization Mass Spectrometry) developed                        Table 1. Only the UNICORE specific software components are
by Grimme et al. [25]–[27] creates initially a trajectory for the                 not included and also MOPAC due to the required license.
molecule of interest and extracts a set of starting conformers
for further calculations. Each ionized conformer gets
                                                                                  Tab. 1. Software included in QCEIMS Docker image.
fragmented at high temperature resembling the conditions
within a mass spectrometer. The resulting fragmentation
distribution over several hundred individual fragmentation runs                   Program                             Version
resembles a mass spectrum and can be compared to                                  Python                              2.7.10
experimental data. Such simulated spectra may be used in
metabolomics to facilitate the identification of compounds.                       R (with Sweave)                     3.2.3
                                                                                  QCEIMS                              2.26I
                                                                                  MNDO99                              7.0
                                                                                  DFTB+                               1.2.2
                                                                                  ORCA                                3.0.3
                                                                                  InChI version 1                     1.04
                                                                                  PubChemPy                           1.0.3
                                                                                  Tex Live                            2016



                                                                                  B. Workflow description
                                                                                      The implemented workflow accepts structure data files
                                                                                  (.sdf) as input, which can contain the structure of one molecule
                                                                                  or more. Due to the UNICORE characteristic that every job is
                                                                                  executed in a single directory, with no subdirectories, it is
                                                                                  necessary to encapsulate each molecular structure in its own
                                                                                  job directory. This is achieved by using the for-each loop
                                                                                  concept of the UNICORE workflow editor and represented as
                                                                                  the outer for loop in Figure 3. After the necessary format
                                                                                  conversion from the .sdf file format into the .tmol format an
                                                                                  open shell check is performed with MOPAC. If the molecule is
                                                                                  not an open shell molecule the configuration file containing the
Fig. 3. The workflow for the prediction of mass spectra based on quantum          QCEIMS parameters is automatically generated. After these
chemical fragmentation calculations is shown. It takes advantage of multiple
control structures to efficiently process even larger numbers of molecules. The   preparation steps, the quantum chemical calculations are
different nesting levels are highlighted by the distinct colors.                  started for the first time. All necessary programs for this step
                                                                                  are already installed in a Docker container. After the
                                                                                  calculations have finished the second encapsulation with a
                                                                                  second for-each loop for the fragmentation calculations is
             V. WORKFLOW AND SCIENCE GATEWAY
                                                                                  performed. This is necessary due to the structure of the
                                                                                  QCEIMS tool. QCEIMS assumes that the files, required for the
A. Overview
                                                                                  subsequent calculations, are available in separate directories
    The first application using both Docker and UNICORE                           and can be processed within these directories. But this
together is the UNICORE QCEIMS workflow for mass spectra                          characteristic does not fit the workflow implementation of
prediction. The implemented UNICORE workflow embeds the                           UNICORE. Further, QCEIMS uses the same file names in the
QCEIMS tool [25]. QCEIMS is very well suited for execution                        subdirectories, which is not a problem if the files remain
on HPC clusters, as it makes strong use of quantum chemistry                      separate but this is not the case for UNICORE. If the different
programs that require high computing power, and the fact that                     files would not be encapsulated into single jobs with the for-
the QCEIMS calculations are easy to parallelize. Furthermore,                     each loop construct, it would not be possible to distinguish
installation and handling are quite complex. An overview of                       them later. After the fragmentation calculation, it is necessary
the workflow is shown in Figure 3. UNICORE is used to                             to merge the single spectra of each fragmentation into the final
encapsulate the whole mass spectra prediction process into a
                            9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


simulated spectrum. In the end the generated results are                    with their experimental spectra [28]. Every molecule of the
automatically integrated in a report and stored in a database.              small test set we have used, showed a score close to 0.6 or
                                                                            higher. The similarity represented by this score is high enough
C. Workflow execution through webinterface                                  to find the simulated compounds under the top 3 hits if
    The implemented QCEIMS workflow can be used by                          matched against a mass spectral database. In most of the cases
importing it into the UNICORE rich client (URC) which is                    it will be the first hit. When comparing Docker containers with
recommended for users who are already familiar with the                     bare-metal execution no differences in run time were
UNICORE environment. With the URC it is also possible to                    observed. Due to the use of the for-each loop construct,
export the implemented workflow as an .xml file and use it in               provided by the UNICORE workflow, it is possible to
the UNICORE portal if the user wants to modify the workflow.                sequentially parallelize the computation of each fragmentation
Another option to execute the workflow is to use the                        step. Furthermore, each quantum chemical simulation can be
UNICORE portal. The UNICORE portal is a further                             distributed over more than a single CPU but this would not be
component of the UNICORE tool set and works seamlessly                      efficient.
with the UNICORE server and the UNICORE workflow
engine. The portal offers a straightforward way to make high                                      VI. FUTURE WORK
level computing resources and complicated workflows                            Due to the special licenses required for the quantum
available to a wide range of users, who are not familiar with               chemistry tools it is not possible to distribute the created
these techniques, and hide the complex underlying structures.               Docker image on Docker Hub, which motivated us to create a
    To use the portal to its full extent it is necessary to have a          portal solution instead.
valid User-Grid certificate imported in the browser that has to                 An important future task is the improvement of the login
be registered once.                                                         procedure into the UNICORE portal. It is quite a considerable
                                                                            effort to appear personally at an authentication center and apply
                                                                            for a personal grid certificate. With the software Unity, which
                                                                            is already integrated in UNICORE, it would be possible to
                                                                            simplify the registration process. Unity would enable the
                                                                            authentication via user credentials provided by the user’s home
                                                                            organization and corresponding Shibboleth entitlements.
                                                                               Another area where enhancements are anticipated is the
                                                                            provision of input options for the user to modify the default
                                                                            parameters of the QCEIMS tool.
                                                                                The automatically generated report at the end of the
                                                                            calculations includes the basic results and therefore could be
                                                                            extended with further graphics and more detailed statistics.
                                                                               Currently the containerized workflow is a stable prototype
                                                                            and is already being used by first experimental groups.

                                                                                                  VII. CONCLUSION
                                                                               The use of Docker in combination with UNICORE made it
Fig. 4. UNICORE portal job submission screen, allowing the selection of a   possible to simplify a complex tool such as QCEIMS with
workflow, its configuration and specification of input data.                regards to its installation process as well as to its handling. The
                                                                            presented QCEIMS Docker image is the first of its kind to
    In order to run the UNICORE QCEIMS workflow it is                       cover the topic of mass spectra prediction and also the first
necessary to create a new job in the "Create job" screen (Figure            publication using Docker with UNICORE workflows.
4). The "Select application field" parameter has to be set to               Furthermore, providing Docker images as a complete execution
"Workflow Template". Now it is possible to "Select a                        environment results in a good reproducibility of results for
template" from the file system, in our case the QCEIMS                      other users. In the Docker image is clearly stated which
workflow XML file. After the template has been uploaded to a                parameters of various additional tools were used, which cannot
HPC instance, the input data can be set and uploaded too.                   change as long as no update of the image is carried out.
                                                                                The use of Docker also revealed weaknesses concerning
D. Evaluation of QCEIMS                                                     security and a missing garbage collection, in particular for the
    Overall the results of the QCEIMS tool, the simulated                   application of Docker container in a massive parallel way on
mass spectra, show a sufficient similarity with the                         HPC clusters. In its present version, Docker can only be used
experimentally generated spectra. We used the absolute value                with drawbacks in parallel environments on HPCs. This is
distance as quality measure to compare the simulated spectra                valid until the problem of the accumulation of data and
                                                                            metadata files is solved. A valid approach to clean up the
                            9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


exited and sometimes dead containers would be the                               [4]    “OpenVZ.” [Online]. Available: http://openvz.org/. [Accessed: 30-
implementation of a Cron job that searches for containers in the                       Mar-2017].
described states and deletes them. A housekeeping strategy to
                                                                                [5]    “LXC.” [Online]. Available: https://linuxcontainers.org/. [Accessed:
solve the metadata accumulation could be a central Docker
                                                                                       30-Mar-2017].
image repository. A further Cron job, that deletes the directory
of the metadata files and loads all images from the repository                  [6]    G. M. Kurtzer, “Singularity 2.1.2 - Linux application and
back into the system would be a possible workaround.                                   environment containers for science,” 01-Jan-2016. [Online].
                                                                                       Available:       https://zenodo.org/record/60736#.WOErqqJBrZs.
    Despite these developed workarounds, it would be desirable
                                                                                       [Accessed: 02-Apr-2017].
if Docker provides a garbage collection tool once for the exited
or dead containers and even more important, for the metadata                    [7]    C. Boettiger and Carl, “An introduction to Docker for reproducible
accumulation due to the container-snapshots. If these problems                         research,” ACM SIGOPS Oper. Syst. Rev., vol. 49, no. 1, pp. 71–79,
can be solved, the popularity of Docker would rise even                                Jan. 2015.
further.                                                                        [8]    T. Bui, “Analysis of Docker Security,”               arXiv.org,   p.
    Another outcome is that UNICORE is a very useful piece                             http://arxiv.org/abs/1501.02967, Jan. 2015.
of software. The installation is not that difficult if considered               [9]    M. R. Berthold et al., “KNIME - The Konstanz Information Miner,”
what the components do. By installing only three packages you                          SIGKDD Explor., vol. 11, no. 1, pp. 26–31, Nov. 2009.
easily get access to a HPC, a graphical user interface with a
                                                                                [10]   T. Oinn et al., “Taverna/myGrid: Aligning a Workflow System with
versatile workflow environment and a web interface. The
wrapping of the QCEIMS tool into a UNICORE workflow has                                the Life Sciences Community,” in Workflows for e-Science, I. J.
been successfully achieved which shows that UNICORE is                                 Taylor, E. Deelman, D. B. Gannon, and M. Shields, Eds. Springer
generally applicable to such kind of problems and could be                             London, 2007, pp. 300–319.
used for future projects. To the present day UNICORE does                       [11]   “Pipeline            Pilot.”          [Online].           Available:
not support Docker directly but the developers are aware of this                       http://accelrys.com/products/collaborative-science/biovia-pipeline-
virtualization technique.                                                              pilot/. [Accessed: 30-Mar-2017].
    Consequently, the combination of Docker and UNICORE                         [12]   W. A. Warr, “Scientific workflow systems: Pipeline Pilot and
represents an excellent setup to carry out fragmentation                               KNIME.,” J. Comput. Aided. Mol. Des., vol. 26, no. 7, pp. 801–4,
calculations using QCEIMS. Large libraries of small molecules                          Jul. 2012.
can be processed conveniently and their simulated spectra can
                                                                                [13]   E. Afgan et al., “The Galaxy platform for accessible, reproducible
be used to help with the identification of metabolites.
                                                                                       and collaborative biomedical analyses: 2016 update.,” Nucleic Acids
                                                                                       Res., p. gkw343, May 2016.
                        ACKNOWLEDGMENT
                                                                                [14]   A. K. Hildebrandt et al., “ballaxy: web services for structural
    The authors acknowledge support by the High Performance                            bioinformatics.,” Bioinformatics, vol. 31, no. 1, pp. 121–122, Sep.
and Cloud Computing Group at the Zentrum für
                                                                                       2014.
Datenverarbeitung of the University of Tübingen, the state of
Baden-Württemberg through bwHPC and the German                                  [15]   K. Benedyczak, B. Schuller, M. Petrova-El Sayed, J. Rybicki, and
Research Foundation (DFG) through grant no INST 37/935-1                               R. Grunzke, “UNICORE 7 — Middleware services for distributed
FUGG. Part of the work presented here was also supported                               and federated computing,” in 2016 International Conference on
through BMBF funded project de.NBI (031 A 534A) and                                    High Performance Computing & Simulation (HPCS), 2016, pp.
MWK Baden-Württemberg funded project CiTAR (“Zitierbare                                613–620.
wissenschaftliche Methoden”). We thank Bernd Schuller for
                                                                                [16]   J. Krüger et al., “The MoSGrid Science Gateway – A Complete
invaluable support with UNICORE, and especially Christoph
                                                                                       Solution for Molecular Simulations,” J. Chem. Theory Comput., vol.
Bauer and Stefan Grimme for the help with QCEIMS.
                                                                                       10, no. 6, pp. 2232–2245, Jun. 2014.
                                                                                [17]   L. Zimmermann, R. Grunzke, and J. Krüger, “Maintaining a Science
                             REFERENCES
                                                                                       Gateway – Lessons Learned from MoSGrid,” in Hawaii
[1]     J. Krüger and O. Kohlbacher, “Containerization and Wrapping of a
                                                                                       International Conference on System Sciences (HICSS), 2017, p.
        Mass Spectra Prediction Workflow,” PeerJ Preprints, pp. 8–10,
                                                                                       http://hdl.handle.net/10125/41918.
        2016.
                                                                                [18]   “PRACE.” [Online]. Available: http://www.prace-ri.eu/. [Accessed:
[2]     S. Soltesz et al., “Container-based operating system virtualization,”
                                                                                       30-Mar-2017].
        in Proceedings of the 2nd ACM SIGOPS/EuroSys European
                                                                                [19]   “UNICORE in the XSEDE infrastructure.” [Online]. Available:
        Conference on Computer Systems 2007 - EuroSys ’07, 2007, vol.
                                                                                       https://portal.xsede.org/software/unicore/. [Accessed: 30-Mar-
        41, no. 3, p. 275.
                                                                                       2017].
[3]     “Docker.”     [Online].  Available:       https://www.docker.com/.
                                                                                [20]   “Human        Brain     Project.”       [Online].     Available:
        [Accessed: 30-Mar-2017].
                                                                                       https://www.humanbrainproject.eu/. [Accessed: 30-Mar-2017].
                           9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


[21]   “VMware.”     [Online].  Available:      http://www.vmware.com.             Impact Mass Spectra of Molecules,” Angew. Chemie Int. Ed., vol.
       [Accessed: 30-Mar-2017].                                                    52, no. 24, pp. 6306–6312, Jun. 2013.
[22]   “Xen.” [Online]. Available: https://www.xenproject.org. [Accessed:   [26]   C. A. Bauer and S. Grimme, “How to Compute Electron Ionization
       30-Mar-2017].                                                               Mass Spectra from First Principles,” J. Phys. Chem. A, vol. 120, no.
[23]   A. Streit et al., “UNICORE 6 - Recent and Future Advancements,”             21, pp. 3755–3766, Jun. 2016.
       JUEL-4319, 2010.                                                     [27]   V. Ásgeirsson et al., “Unimolecular decomposition pathways of
[24]   R. Grunzke, F. Jug, B. Schuller, R. Jäkel, G. Myers, and W. E.              negatively charged nitriles by ab initio molecular dynamics,” Phys.
       Nagel, “Seamless HPC Integration of Data-intensive KNIME                    Chem. Chem. Phys., vol. 18, no. 45, pp. 31017–31026, 2016.
       Workflows via UNICORE,” in 4th International Workshop on             [28]   S. Stein and D. Scott, “Optimization and testing of mass spectral
       Parallelism in Bioinformatics (PBio 2016), 2016, p. (accepted).             library search algorithms for compound identification“, Journal of
[25]   S. Grimme, “Towards First Principles Calculation of Electron                the American Society for Mass Spectrometry, 5(9):859–866, 1994.