=Paper=
{{Paper
|id=Vol-2363/paper6
|storemode=property
|title=Efficient Mass Spectra Prediction through Container Orchestration with a Scientific Workflow
|pdfUrl=https://ceur-ws.org/Vol-2363/paper6.pdf
|volume=Vol-2363
|dblpUrl=https://dblp.org/rec/conf/iwsg/HanussekB0K17
}}
==Efficient Mass Spectra Prediction through Container Orchestration with a Scientific Workflow==
9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 Efficient Mass Spectra Prediction through Container Orchestration with a Scientific Workflow Maximilian Hanussek1,2,3, Felix Bartusch1,2,3, Oliver Kohlbacher2,3,4,5 Jens Krüger1* 2 Center for Bioinformatics, 3 Dept. of Computer Science, 4 1 High-Performance and Cloud Computing Group Quantitative Biology Center, University of Tübingen Zentrum für Datenverarbeitung, University of Tübingen Tübingen, Germany 5 Tübingen, Germany Biomolecular Interactions, Max Planck Institute for * Developmental Biology, Tübingen, Germany jens.krueger@uni-tuebingen.de Abstract—The mass spectroscopic fragmentation of small Most installation and computing environment problems can molecules such as metabolites can be simulated with QCEIMS. In be solved by providing a container for any desired tool. Almost this paper we present our work dealing with the containerization every required program can be wrapped into such a container, of the complex and interdependent software stack. The which saves time for installation and does not require special simulation protocol has been mapped to a UNICORE workflow permissions. The container is a self-contained environment, no enabling convenient access to powerful computing resources. To matter on which system it runs. This fact leads to a good offer a maximum of convenience to the users a simple portal was reproducibility of already achieved results and is especially deployed hiding the complexity of technical details. important in the natural sciences. Using Docker for distributing software stacks could be one approach to solve installation and Keywords—containerization; workflows; reproducibilty; science gateway; mass spectrometry; quantum mechanics computing environment problems. But the user-friendliness concerning the operation of a complex tool can not be increased through it. I. INTRODUCTION Another technology that can be used to increase the user- In the natural sciences, there are many software friendliness are workflows representing specific scientific applications which are commonly used, but which are not easy protocols. In the meantime, many workflow platforms are to install or to apply. Installation problems can originate from available such as KNIME [9], TAVERNA [10], Pipeline Pilot special computing environments being required or the number [11], [12], Galaxy [13], [14], and UNICORE [15]. The of interdependent additional software packages that need to be Uniform Interface to Computing Resources (UNICORE) is a installed. Furthermore, many programs require the usage of mature so-called middleware solution to create workflows and command line interaction by the user. This knowledge is not in addition, get access to distributed computing resources. always present and should not be a prerequisite. But over time, UNICORE is used in many research fields and settings from technologies have emerged that allow an easy installation and small projects up to large transnational projects like MoSGrid operation of complex tools [1]. [16], [17], the European PRACE infrastructure [18], the US One particular technology that gained popularity in recent XSEDE Initiative [19] or the Human Brain Project [20]. An years is container virtualization. Representatives of container advantage of UNICORE is that it provides access to high- virtualization methods based on the Linux system are Linux- performance computing (HPC) clusters and file systems and VServer [2], Docker [3], OpenVZ [4], Linux Container (LXC) offers the possibility to generate workflows suitable for HPC [5] and Singularity [6]. Among all these representatives, environments. Docker is the most prominent. Docker and its container- The interaction between Docker and UNICORE makes it technology are a lightweight alternative to full virtual possible to simplify both, the installation process and the use of machines. Since the virtualization is running on the host OS, it complex tools. In the following chapters, Docker and is possible to run multiple applications in parallel without UNICORE are explained in more detail. establishing a new kernel for each application, which makes the container-based technique more lightweight than a hypervisor-based approach [7], [8]. 9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 II. DOCKER as well as a workflow editor, a web interface (UNICORE- There are two major concepts in the field of software portal) and a more advanced graphical user interface, the UNICORE rich client (URC). One advantage that UNICORE virtualization, container-based virtualization and hypervisor- offers is that it is designed to abstract computing resource based virtualization. Both multilayered approaches are specific details and through that simplifies the user experience. illustrated in Figure 1. Examples for hypervisor-based Furthermore, it is extensible due to the use of standardized virtualization software are VMware [21] or Xen [22]. A major APIs, which for example makes it possible to run KNIME aspect is that a hypervisor-based virtualization establishes a nodes on a HPC via UNICORE [24]. No special operating full virtual machine on top of the host operating system. Such system is required as UNICORE is completely written in the a virtual machine has its own operating system (Guest OS) platform independent programming languages Java and and own kernel. This virtualization technology provides a Python. Another not negligible aspect is the need to use a virtualization on the hardware level. certain safety standard to prevent the loss of sensible data. Due to different security and authorization methods offered by UNICORE the connection between client and server is considered to be save [15]. Fig. 1. Container-based approach including applications and the necessary binaries and libraries building up on the Docker engine (left). Hypervisor based approach with an additional guest OS on top of the hypervisor layer (right). In contrast to the hypervisor-based virtualization, Docker establishes a virtualization based on the host OS. The virtual environments are directly run on the host kernel, which are usually named containers [8]. It is possible to create own Docker images, which serve as template for the Docker containers. The images are created via the so-called Dockerfile, which is a plain text file that specifies how the containers are created and run. Docker images are built upon a base image which can be any operating system that fits to the host OS, on a Linux system for example Ubuntu or CentOS. Images consist of a series of data layers on top of the base image. Worth Fig. 2. Overview of the different UNICORE components and their interaction mentioning is that a variety of containers can be started from with each other [15]. only one image, each container does not need its own image. The already available images can be used as a new base image The UNICORE architecture consists of five layers (user- and can be extended further [7]. This is simply done by adding layer, gateway-layer, UNICORE/X-layer, TSI-layer, resource- a new data layer which is more efficient than building the layer). The user-layer provides end-user clients and whole image from scratch. To work with these multiple layers, applications but also other UNICORE servers and web portals. Docker uses the Union File System to merge the different The gateway-layer serves mostly as a firewall transversal point layers into a single and consistent file system which is one of and forwards information such as IP addresses or SSL the underlying techniques. A Docker container provides a certificates via the connecting client to the following servers. virtual environment for its contained applications by leveraging The UNICORE/X is the central component of UNICORE. It the Linux kernel features control groups (cgroups) for receives the client requests, which has been submitted via the accounting processes of the container and namespaces for gateway, authenticates the request, checks the authorization providing isolated instances of host resources [8]. and in the end, invokes the appropriate service. The Target System Interface (TSI) is connected with the local operating III. UNICORE system, file system and usually a batch system for the resource management. The tasks of the TSI are for example to submit The UNICORE software package is developed at the the sent jobs from the client, check the status of the jobs, or research center in Jülich and by further partners [15], [23]. It perform the I/O operations. A schematic illustration of the provides different components for handling HPC environments different layers is shown in Figure 2. 9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 IV. QCEIMS UNICORE workflow. Further Docker is used to encapsulate The fragmentation of small molecules as it occurs within a and execute all QCEIMS calculations in Docker containers. mass spectrometry experiment can be simulated with quantum The created Docker image contains all necessary software to chemical simulations. The method called QCEIMS (Quantum execute the QCEIMS calculations. These tools are listed in Chemistry Electron Ionization Mass Spectrometry) developed Table 1. Only the UNICORE specific software components are by Grimme et al. [25]–[27] creates initially a trajectory for the not included and also MOPAC due to the required license. molecule of interest and extracts a set of starting conformers for further calculations. Each ionized conformer gets Tab. 1. Software included in QCEIMS Docker image. fragmented at high temperature resembling the conditions within a mass spectrometer. The resulting fragmentation distribution over several hundred individual fragmentation runs Program Version resembles a mass spectrum and can be compared to Python 2.7.10 experimental data. Such simulated spectra may be used in metabolomics to facilitate the identification of compounds. R (with Sweave) 3.2.3 QCEIMS 2.26I MNDO99 7.0 DFTB+ 1.2.2 ORCA 3.0.3 InChI version 1 1.04 PubChemPy 1.0.3 Tex Live 2016 B. Workflow description The implemented workflow accepts structure data files (.sdf) as input, which can contain the structure of one molecule or more. Due to the UNICORE characteristic that every job is executed in a single directory, with no subdirectories, it is necessary to encapsulate each molecular structure in its own job directory. This is achieved by using the for-each loop concept of the UNICORE workflow editor and represented as the outer for loop in Figure 3. After the necessary format conversion from the .sdf file format into the .tmol format an open shell check is performed with MOPAC. If the molecule is not an open shell molecule the configuration file containing the Fig. 3. The workflow for the prediction of mass spectra based on quantum QCEIMS parameters is automatically generated. After these chemical fragmentation calculations is shown. It takes advantage of multiple control structures to efficiently process even larger numbers of molecules. The preparation steps, the quantum chemical calculations are different nesting levels are highlighted by the distinct colors. started for the first time. All necessary programs for this step are already installed in a Docker container. After the calculations have finished the second encapsulation with a second for-each loop for the fragmentation calculations is V. WORKFLOW AND SCIENCE GATEWAY performed. This is necessary due to the structure of the QCEIMS tool. QCEIMS assumes that the files, required for the A. Overview subsequent calculations, are available in separate directories The first application using both Docker and UNICORE and can be processed within these directories. But this together is the UNICORE QCEIMS workflow for mass spectra characteristic does not fit the workflow implementation of prediction. The implemented UNICORE workflow embeds the UNICORE. Further, QCEIMS uses the same file names in the QCEIMS tool [25]. QCEIMS is very well suited for execution subdirectories, which is not a problem if the files remain on HPC clusters, as it makes strong use of quantum chemistry separate but this is not the case for UNICORE. If the different programs that require high computing power, and the fact that files would not be encapsulated into single jobs with the for- the QCEIMS calculations are easy to parallelize. Furthermore, each loop construct, it would not be possible to distinguish installation and handling are quite complex. An overview of them later. After the fragmentation calculation, it is necessary the workflow is shown in Figure 3. UNICORE is used to to merge the single spectra of each fragmentation into the final encapsulate the whole mass spectra prediction process into a 9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 simulated spectrum. In the end the generated results are with their experimental spectra [28]. Every molecule of the automatically integrated in a report and stored in a database. small test set we have used, showed a score close to 0.6 or higher. The similarity represented by this score is high enough C. Workflow execution through webinterface to find the simulated compounds under the top 3 hits if The implemented QCEIMS workflow can be used by matched against a mass spectral database. In most of the cases importing it into the UNICORE rich client (URC) which is it will be the first hit. When comparing Docker containers with recommended for users who are already familiar with the bare-metal execution no differences in run time were UNICORE environment. With the URC it is also possible to observed. Due to the use of the for-each loop construct, export the implemented workflow as an .xml file and use it in provided by the UNICORE workflow, it is possible to the UNICORE portal if the user wants to modify the workflow. sequentially parallelize the computation of each fragmentation Another option to execute the workflow is to use the step. Furthermore, each quantum chemical simulation can be UNICORE portal. The UNICORE portal is a further distributed over more than a single CPU but this would not be component of the UNICORE tool set and works seamlessly efficient. with the UNICORE server and the UNICORE workflow engine. The portal offers a straightforward way to make high VI. FUTURE WORK level computing resources and complicated workflows Due to the special licenses required for the quantum available to a wide range of users, who are not familiar with chemistry tools it is not possible to distribute the created these techniques, and hide the complex underlying structures. Docker image on Docker Hub, which motivated us to create a To use the portal to its full extent it is necessary to have a portal solution instead. valid User-Grid certificate imported in the browser that has to An important future task is the improvement of the login be registered once. procedure into the UNICORE portal. It is quite a considerable effort to appear personally at an authentication center and apply for a personal grid certificate. With the software Unity, which is already integrated in UNICORE, it would be possible to simplify the registration process. Unity would enable the authentication via user credentials provided by the user’s home organization and corresponding Shibboleth entitlements. Another area where enhancements are anticipated is the provision of input options for the user to modify the default parameters of the QCEIMS tool. The automatically generated report at the end of the calculations includes the basic results and therefore could be extended with further graphics and more detailed statistics. Currently the containerized workflow is a stable prototype and is already being used by first experimental groups. VII. CONCLUSION The use of Docker in combination with UNICORE made it Fig. 4. UNICORE portal job submission screen, allowing the selection of a possible to simplify a complex tool such as QCEIMS with workflow, its configuration and specification of input data. regards to its installation process as well as to its handling. The presented QCEIMS Docker image is the first of its kind to In order to run the UNICORE QCEIMS workflow it is cover the topic of mass spectra prediction and also the first necessary to create a new job in the "Create job" screen (Figure publication using Docker with UNICORE workflows. 4). The "Select application field" parameter has to be set to Furthermore, providing Docker images as a complete execution "Workflow Template". Now it is possible to "Select a environment results in a good reproducibility of results for template" from the file system, in our case the QCEIMS other users. In the Docker image is clearly stated which workflow XML file. After the template has been uploaded to a parameters of various additional tools were used, which cannot HPC instance, the input data can be set and uploaded too. change as long as no update of the image is carried out. The use of Docker also revealed weaknesses concerning D. Evaluation of QCEIMS security and a missing garbage collection, in particular for the Overall the results of the QCEIMS tool, the simulated application of Docker container in a massive parallel way on mass spectra, show a sufficient similarity with the HPC clusters. In its present version, Docker can only be used experimentally generated spectra. We used the absolute value with drawbacks in parallel environments on HPCs. This is distance as quality measure to compare the simulated spectra valid until the problem of the accumulation of data and metadata files is solved. A valid approach to clean up the 9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 exited and sometimes dead containers would be the [4] “OpenVZ.” [Online]. Available: http://openvz.org/. [Accessed: 30- implementation of a Cron job that searches for containers in the Mar-2017]. described states and deletes them. A housekeeping strategy to [5] “LXC.” [Online]. Available: https://linuxcontainers.org/. [Accessed: solve the metadata accumulation could be a central Docker 30-Mar-2017]. image repository. A further Cron job, that deletes the directory of the metadata files and loads all images from the repository [6] G. M. Kurtzer, “Singularity 2.1.2 - Linux application and back into the system would be a possible workaround. environment containers for science,” 01-Jan-2016. [Online]. Available: https://zenodo.org/record/60736#.WOErqqJBrZs. Despite these developed workarounds, it would be desirable [Accessed: 02-Apr-2017]. if Docker provides a garbage collection tool once for the exited or dead containers and even more important, for the metadata [7] C. Boettiger and Carl, “An introduction to Docker for reproducible accumulation due to the container-snapshots. If these problems research,” ACM SIGOPS Oper. Syst. Rev., vol. 49, no. 1, pp. 71–79, can be solved, the popularity of Docker would rise even Jan. 2015. further. [8] T. Bui, “Analysis of Docker Security,” arXiv.org, p. Another outcome is that UNICORE is a very useful piece http://arxiv.org/abs/1501.02967, Jan. 2015. of software. The installation is not that difficult if considered [9] M. R. Berthold et al., “KNIME - The Konstanz Information Miner,” what the components do. By installing only three packages you SIGKDD Explor., vol. 11, no. 1, pp. 26–31, Nov. 2009. easily get access to a HPC, a graphical user interface with a [10] T. Oinn et al., “Taverna/myGrid: Aligning a Workflow System with versatile workflow environment and a web interface. The wrapping of the QCEIMS tool into a UNICORE workflow has the Life Sciences Community,” in Workflows for e-Science, I. J. been successfully achieved which shows that UNICORE is Taylor, E. Deelman, D. B. Gannon, and M. Shields, Eds. Springer generally applicable to such kind of problems and could be London, 2007, pp. 300–319. used for future projects. To the present day UNICORE does [11] “Pipeline Pilot.” [Online]. Available: not support Docker directly but the developers are aware of this http://accelrys.com/products/collaborative-science/biovia-pipeline- virtualization technique. pilot/. [Accessed: 30-Mar-2017]. Consequently, the combination of Docker and UNICORE [12] W. A. Warr, “Scientific workflow systems: Pipeline Pilot and represents an excellent setup to carry out fragmentation KNIME.,” J. Comput. Aided. Mol. Des., vol. 26, no. 7, pp. 801–4, calculations using QCEIMS. Large libraries of small molecules Jul. 2012. can be processed conveniently and their simulated spectra can [13] E. Afgan et al., “The Galaxy platform for accessible, reproducible be used to help with the identification of metabolites. and collaborative biomedical analyses: 2016 update.,” Nucleic Acids Res., p. gkw343, May 2016. ACKNOWLEDGMENT [14] A. K. Hildebrandt et al., “ballaxy: web services for structural The authors acknowledge support by the High Performance bioinformatics.,” Bioinformatics, vol. 31, no. 1, pp. 121–122, Sep. and Cloud Computing Group at the Zentrum für 2014. Datenverarbeitung of the University of Tübingen, the state of Baden-Württemberg through bwHPC and the German [15] K. Benedyczak, B. Schuller, M. Petrova-El Sayed, J. Rybicki, and Research Foundation (DFG) through grant no INST 37/935-1 R. Grunzke, “UNICORE 7 — Middleware services for distributed FUGG. Part of the work presented here was also supported and federated computing,” in 2016 International Conference on through BMBF funded project de.NBI (031 A 534A) and High Performance Computing & Simulation (HPCS), 2016, pp. MWK Baden-Württemberg funded project CiTAR (“Zitierbare 613–620. wissenschaftliche Methoden”). We thank Bernd Schuller for [16] J. Krüger et al., “The MoSGrid Science Gateway – A Complete invaluable support with UNICORE, and especially Christoph Solution for Molecular Simulations,” J. Chem. Theory Comput., vol. Bauer and Stefan Grimme for the help with QCEIMS. 10, no. 6, pp. 2232–2245, Jun. 2014. [17] L. Zimmermann, R. Grunzke, and J. Krüger, “Maintaining a Science REFERENCES Gateway – Lessons Learned from MoSGrid,” in Hawaii [1] J. Krüger and O. Kohlbacher, “Containerization and Wrapping of a International Conference on System Sciences (HICSS), 2017, p. Mass Spectra Prediction Workflow,” PeerJ Preprints, pp. 8–10, http://hdl.handle.net/10125/41918. 2016. [18] “PRACE.” [Online]. Available: http://www.prace-ri.eu/. [Accessed: [2] S. Soltesz et al., “Container-based operating system virtualization,” 30-Mar-2017]. in Proceedings of the 2nd ACM SIGOPS/EuroSys European [19] “UNICORE in the XSEDE infrastructure.” [Online]. Available: Conference on Computer Systems 2007 - EuroSys ’07, 2007, vol. https://portal.xsede.org/software/unicore/. [Accessed: 30-Mar- 41, no. 3, p. 275. 2017]. [3] “Docker.” [Online]. Available: https://www.docker.com/. [20] “Human Brain Project.” [Online]. Available: [Accessed: 30-Mar-2017]. https://www.humanbrainproject.eu/. [Accessed: 30-Mar-2017]. 9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 [21] “VMware.” [Online]. Available: http://www.vmware.com. Impact Mass Spectra of Molecules,” Angew. Chemie Int. Ed., vol. [Accessed: 30-Mar-2017]. 52, no. 24, pp. 6306–6312, Jun. 2013. [22] “Xen.” [Online]. Available: https://www.xenproject.org. [Accessed: [26] C. A. Bauer and S. Grimme, “How to Compute Electron Ionization 30-Mar-2017]. Mass Spectra from First Principles,” J. Phys. Chem. A, vol. 120, no. [23] A. Streit et al., “UNICORE 6 - Recent and Future Advancements,” 21, pp. 3755–3766, Jun. 2016. JUEL-4319, 2010. [27] V. Ásgeirsson et al., “Unimolecular decomposition pathways of [24] R. Grunzke, F. Jug, B. Schuller, R. Jäkel, G. Myers, and W. E. negatively charged nitriles by ab initio molecular dynamics,” Phys. Nagel, “Seamless HPC Integration of Data-intensive KNIME Chem. Chem. Phys., vol. 18, no. 45, pp. 31017–31026, 2016. Workflows via UNICORE,” in 4th International Workshop on [28] S. Stein and D. Scott, “Optimization and testing of mass spectral Parallelism in Bioinformatics (PBio 2016), 2016, p. (accepted). library search algorithms for compound identification“, Journal of [25] S. Grimme, “Towards First Principles Calculation of Electron the American Society for Mass Spectrometry, 5(9):859–866, 1994.