A feedback guided interface for elastic computing Sebastian Schönherr1,2,∗ , Lukas Forer1,2,∗ , Hansi Weißensteiner1,2 , Florian Kronenberg2 , Günther Specht1 , Anita Kloss-Brandstätter2 ∗ contributed equally 1 Databases and Information Systems Institute of Computer Science University of Innsbruck, Austria sebastian.schoenherr@uibk.ac.at 2 Division of Genetic Epidemiology Department of Medical Genetics, Molecular and Clinical Pharmacology Innsbruck Medical University, Austria lukas.forer@i-med.ac.at ABSTRACT 1. INTRODUCTION Computer Science plays an important role in today’s Genet- In recent years Computer Science became an essential ics. New sequencing methods produce an enormous amount part in the field of Genetics. Especially through the advent of data, pushing genetic laboratories to storage and com- of Next Generation Sequencing (NGS), whereby a human putational limits. New approaches are needed to eliminate genome (3 billion base pairs/chromosome set) can be se- these shortcomings and provide possibilities to reproduce quenced in acceptable time, the amount of data is growing current solutions and algorithms in the area of Bioinformat- significantly, exceeding all known dimensions in Genetics. ics. In this paper a system is proposed which simplifies the Figure 1 shows a comparison between the reducing DNA se- access to computational resources and associated compu- quencing costs and Moore’s law. Moore’s law is used as a tational models of cluster architectures, assists end users in reference to show that computer hardware can currently not executing and monitoring developed algorithms via a web in- keeping pace with the progress in DNA sequencing. Further- terface and provides an interface to add future developments more, the amount of complete sequenced individuals is grow- or any kind of programs. We demonstrate on existing algo- ing exponentially from year to year [11], making new models rithms how an integretation can be done with little effort, necessary. For instance, to store the data of one complete making it especially useful for the evaluation and simplified human DNA (Deoxyribonucleic acid) in raw format with 30- usage of current algorithms. times coverage, 30 terabytes of data is produced. In the area of Copy Number Variations, a possible cause Categories and Subject Descriptors for many complex genetic disorders, high throughput algo- rithms are needed to process and analyze several hundred H.4 [Information Systems Applications]: Miscellaneous gigabytes of raw input data [16] [6], yielding to a wall time of up to one week for a typical study size [18]. This remark- General Terms able increase of data and time causes genetic departments to consider new ways of importing and storing data as well Distributed System, Experimentation, Application as improving performance of current algorithms. Cluster architectures in connection with associated mod- Keywords els have the potential to solve this issue, but especially for Bioinformatics, Hadoop, MapReduce, Cloud computing small departments often gainless and unaffordable. Using clusters on demand, also referred to Infrastructure as a Ser- vice (IaaS), builds therefore a good opportunity to circle these issues. To capitalize the full potential of IaaS, a com- bination with distribution models like MapReduce [5] is for specific applications both possible and obvious. Several iso- lated applications [9], [10], [14] already exist using a dis- tributed approach for storing data and processing algorithms. But since no general system is given to execute those solu- tions, an evaluation and reproducibility is often not feasible. Scientists need to setup a cluster on their own or using a provided remote cluster architecture to evaluate a published 23rd GI-Workshop on Foundations of Databases (Grundlagen von Daten- algorithm, being both time wasting and insecure for sensi- banken), 31.05.2011 - 03.06.2011, Obergurgl, Austria. tive data. Copyright is held by the author/owner(s). 109 95.263.072 $ Moore’s law Cost per Genome 100.000.000 $ 13.801.124 $ 10.000.000 $ Cost per Genome in $ (log scale) advent of NGS 11.732.535 $ 3.063.820 $ 1.000.000 $ 154.714 $ 100.000 $ 29.092 $ 10.000 $ Sept01 Sept02 Oct03 Apr04 Oct04 Apr05 Oct05 Apr06 Okt06 Apr07 Oct07 Apr08 Oct08 Apr09 Oct09 Apr10 Oct10 Mar02 Mar03 Jan04 Jul04 Jan05 Jul05 Jan06 Jul06 Jan07 Jul07 Jan08 Jul08 Jan09 Jul09 Jan10 Jul10 Date Figure 1: Comparision of DNA sequencing cost with Moore’s law; Data from [17] In this paper we present the idea to build an integrated by step and distributes only whole jobs among the cluster. system for scientists in the area of Bioinformatics to (1) get access to distributed cluster architectures and execute ex- isting algorithms, (2) build maintainable and reproducible 3. ARCHITECTURE workflows and (3) provide an interface to add future de- A modular architecture is suggested in Figure 2, separat- velopments or any kind of programs to the system without ing the process of instantiate and set up a cluster (Cloud- detailed IT knowledge. The reminder of this paper is struc- gene) from the process of monitor and run a program (EMI ). tured as follows: Section 2 gives an overview of the related Based on open source frameworks like Apache Hadoop [2] work. In section 3 the architecture of our suggested system and Apache Whirr [3], we implemented a prototype to ver- is explained in more detail with potential case studies in sec- ify our approach. The user utilizes Cloudgene to set up a tion 4. Section 5 shows necessary future work and the paper cluster architecture to his needs through XML configuration ends with a conclusion in section 6. files. This allows adding new algorithms dynamically with- out digging into Cloudgene to deep. A fully operable and customized cluster is then provided, including all necessary 2. RELATED WORK user data. In a subsequent step EMI (Elastic MapReduce Cluster solutions guided by a web-interface to execute dis- Interface) is launched on the master node of the cluster. tributed algorithms like Myrna [9], CrossBow [10] or Cloud- EMI can be seen as an abstraction of the underlying system Burst [13] already exist. Unfortunately, the user must login architecture from the end user, lies on top of the integrated to the Amazon Web Services (AWS) console to monitor the programs and allows the user to communicate and interact progress of executed jobs or to shutdown the cluster after with the cluster as well as receive feedback of currently exe- execution. Additionaly, a data storage in S3 buckets is of- cuted workflows (see Figure 3). EMI can be disabled in case ten required and a custom web interface needs to be imple- a program already includes an interface by its own, yield- mented for every single approach. ing to the most general approach to execute any kind of Galaxy [7] is a software system which facilitates the creation, developed solution. Both parts can be operated separately execution and maintainability of pipelines in a fast and user via configuration files with clear defined input and output friendly way. The platform itself executes the scripts and the variables. user has the possibility to monitor the progress. Galaxy’s ex- tension CloudMan [1] provides the possibility to install and 3.1 Cloudgene execute Galaxy on Amazon EC2 (Elastic Compute Cloud). Amazon provides with its EC2 the currently most devel- However, the user needs to start the master node manually oped service for public clouds in the area of IaaS. Cloudgene by using the AWS console and Galaxy does not provide a supports besides EC2 also Rackspace [12] to provide access native support of Hadoop programs, executes modules step to cluster infrastructure. As mentioned in the introduction 110 a combination with MapReduce is useful: In this paradigm, the master node chops up data into chunks and distributes it over all active worker nodes (map step). Subsequently, the master node reassigns coherent map results to worker nodes (sort and shuffle) to calculate the final result (reduce Custom Programs step). For this project Apache Hadoop’s implementation of Hadoop EMI CloudBurst ... MapReduce and its distributed file system (HDFS) are used. Using Whirr as a connector, Cloudgene is able to instance a full working EC2 or Rackspace cluster for end users with various defined properties and copies the necessary program data and configuration files to the cluster. Examples for de- fined variables could be the desired image, amount and kind of instances, HDFS options, MapReduce properties and the user’s SSH public key. Amazon already provides several pre- Web Container Cloudgene defined images for all sorts of use cases, which can be be Whirr used with Cloudgene (e.g. http://www.cloudbiolinux.com). Restlet ExtJS Cloudgene takes over the customization of predefined images and installs services like MapReduce, in our case included in Cloudera’s distribution of Apache Hadoop [4]. The cluster configuration is defined in an XML-based file format, includ- ing all necessary information for a successful cluster boot. Cloudgene routinely checks if new configurations are added XML Access and offers the possibility to execute newly defined programs. Config Manager Since EC2 is using a pay-per-use model, end users must provide their Amazon Access ID and Secret Key, which is transferred via Cloudgene to Amazon in a secure way. Alter- natively, Cloudgene can also be launched on every machine Figure 2: Architecture of the suggested system in- having Java installed, eliminating the transfer via our server. cluding Cloudgene and EMI Cloudgene solves one important issue and gives genetic de- partments access to computational power and storage. A still unresolved problem is the lack of a graphical user inter- face to control jobs deriving from command line based ap- plications. Especially the need of putting enormous amount of local data into HDFS has to be considered. To overcome these shortcomings, a user interface (EMI) was designed. 3.2 Efficient MapReduce Interface (EMI) Running Hadoop MapReduce programs on a cluster re- quires the execution of several non-trivial steps: First, the user must upload all input data to the master node, copy the data into the proprietary HDFS, run the Hadoop MapRe- duce job, export the results from the filesystem and finally download them to the local workstation. For researchers without expertise in Computer Science these tasks turns out to be very challenging. For this purpose we developed EMI which facilitates the execution, monitoring and evaluation of MapReduce jobs. A web interface, which runs on the master node of the cluster, enables the execution of jobs through well-structured wizards and setting all required pa- rameters step by step. As several studies have shown, repro- ducibility of data analysis is one of the greatest problems in biomedical publications [15]. For this purpose the execution of a MapReduce job with its parameters and input data is logged, thus a fast comparison of experiments with differ- ent settings is possible. Moreover, the user always has the full control over an execution of each job and can monitor Figure 3: Workflow of the system including Cloud- its current progress and status. All running jobs are listed gene and EMI whereby the progress of the map and reduce phase are dis- played separately. Since using resources from Amazon costs money, EMI informs the user about the uptime of the clus- ter and the number of rented instances (Figure 4). The modular architecture enables a fast integration of any Hadoop job which could be normally executed through the 111 command line. A simple and clear XML configuration file describes the input and output parameters of the program and contains other relevant information that are necessary CloudBurst to start the job (see Section 4). In addition to this file, a zip archive file exists which contains all software relevant data hadoop jar emi/cloudburst/CloudBurst.jar \ (e.g. jar file, meta data, configuration files). With those $input1 $input2 $output1 36 36 3 0 1 240 \ files, EMI automatically generates a web interface in which 48 24 24 128 16 the possibility to set each defined parameter through wiz- ards and to run the defined job by a single click is provided. As mentioned earlier, all input data must be put into the ro- bust and fault-tolerant HDFS. As this process is very time- Reference Genome intensive an error prone, EMI supports the user by provid- data/cloudburst/s_suis.br ing a wizard which enables the import of data from different sources (FTP, HTTP, Amazon S3 buckets or local file up- loads). In addition, files defined as output parameters can be exported and downloaded as a zip archive or can be up- loaded to Amazon S3 or FTP servers. EMI supports a multi- Reads user mode whereby all data by a certain user are password data/cloudburst/100k.br protected and executed jobs are scheduled through a queue system. Overall, EMI is fully independent from Cloudgene and can be installed on a local Hadoop cluster too. 4. CASE STUDIES Results data/cloudburst/results In this section we explain how new programs can be in- tegrated into Cloudgene and EMI. Based on two different biomedical software solutions we demonstrate the diversity and simplicity of our approach. 4.1 CloudBurst After the XML file is uploaded to the Cloudgene server, CloudBurst is a parallel read-mapping algorithm to map the user starts a web browser to (1) login to Cloudgene, (2) NGS data to the human genome and other reference genomes start up a cluster preconfigured with CloudBurst and (3) [13]. It is implemented as a MapReduce program using run and monitor jobs with EMI (Figure 4). Hadoop and can be executed with the following command: Compared to a standard manual approach, this eliminates error-prone and time-consuming tasks such as (1) setting up hadoop jar emi/cloudburst/CloudBurst.jar \ a cluster and connecting via the command line onto the mas- reference_genome reads results 36 36 3 0 1 240 \ ter node, (2) uploading and importing data into HDFS, (3) 48 24 24 128 16 exporting final results from HDFS and downloading them and (4) executing and reproducing MapReduce jobs with In order to execute CloudBurst we create a configuration different configurations via a web interface. This shows, file for Cloudgene which starts a Hadoop cluster on Amazon that an easy integration can be done using a simple XML EC2 with a standard Ubuntu Linux with open Hadoop ports configuration, supporting and guiding researchers as far as 50030 and 50070. The corresponding XML has the following possible. structure: 4.2 HaploGrep CloudBurst HaploGrep is a reliable algorithm implemented in a web application to determine the haplogroup affiliation of thou- the ports 80 (http) and 443 (https), therefore this ports are marked as open. The configuration file for Cloudgene with all requirements looks as follows: As CloudBurst has no graphical user interface, we install EMI on the Amazon EC2 cluster and use it for user inter- actions. For this purpose the command above with its ar- Haplogrep guments must be translated into the following configuration file: system. Its modular architecture enables a fast integration of any Hadoop job which could be only executed through the command line. By hiding the low-level informatics, it is After the cluster setup is finalized, Cloudgene returns a the ideal system for researchers without deeper knowledge in web address which points to the installed instance of Hap- Computer Science. Moreover, our system is not constricted loGrep. to the life sciences and can be used in nearly every applica- tion range. Overall, it is a first approach in order to narrow 5. FUTURE WORK the gap between cloud-computing and usability. One of the biggest advantages of IaaS is the changable amount of needed datanodes on demand. Thus, the next 7. ACKNOWLEDGMENTS version of Cloudgene is conceived to provide functions for Sebastian Schönherr was supported by a scholarship from adding and removing instances during runtime. Currently, the University of Innsbruck (Doktoratsstipendium aus der clusters started with Cloudgene are not data persistent which Nachwuchsförderung, MIP10/2009/3). Hansi Weißensteiner yields to a data loss after a shutdown is fulfilled. For this was supported by a scholarship from the Autonomous Pro- purpose we plan to store all results on persistent Amazon vince of Bozen/Bolzano (South Tyrol). The project was EBS volumes. Furthermore, a simple user interface for Ha- supported by the Amazon Research Grant. We thank the doop is not only useful for the end user but also for devel- Whirr Mailinglist especially Tom White and Andrei Savu opers. It supports them during the whole prototyping and for their assistance. testing process of novel MapReduce algorithms by highlight- ing performance bottlenecks. Thus, we plan to implement time measurements of the map, reduce and shuffle phase 8. REFERENCES and to visualize them in an intuitive chart. Additionally, [1] E. Afgan, D. Baker, N. Coraor, B. Chapman, Hadoop plans in its next generation approach to support al- A. Nekrutenko, and J. Taylor. Galaxy CloudMan: ternate programming paradigms to MapReduce, what is par- delivering cloud compute clusters. BMC ticularly important for applications (e.g. K-Means) where Bioinformatics, 11 Suppl 12:S4, 2010. custom frameworks out-perform MapReduce by an order of [2] Apache Hadoop. http://hadoop.apache.org. magnitude. [3] Apache Whirr. http://incubator.apache.org/whirr/. [4] Cloudera. http://www.cloudera.com/. 6. CONCLUSION [5] J. Dean and S. Ghemawat. MapReduce: simplified We presented a software system for running and maintain- data processing on large clusters. In OSDI’04: ing elastic computer clusters. Our approach combines the Proceedings of the 6th conference on Symposium on individual steps of setting up a cluster into a user-friendly Opearting Systems Design & Implementation, pages 113 10–10, Berkeley, CA, USA, 2004. USENIX Association. [6] L. Forer, S. Schönherr, H. Weißensteiner, F. Haider, T. Kluckner, C. Gieger, H. E. Wichmann, G. Specht, F. Kronenberg, and A. Kloss-Brandstätter. CONAN: copy number variation analysis software for genome-wide association studies. BMC Bioinformatics, 11:318, 2010. [7] J. Goecks, A. Nekrutenko, J. Taylor, E. Afgan, G. Ananda, D. Baker, D. Blankenberg, R. Chakrabarty, N. Coraor, J. Goecks, G. Von Kuster, R. Lazarus, K. Li, A. Nekrutenko, J. Taylor, and K. Vincent. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol., 11:R86, 2010. [8] A. Kloss-Brandstättter, D. Pacher, S. Schönherr, H. Weißensteiner, R. Binna, G. Specht, and F. Kronenberg. HaploGrep: a fast and reliable algorithm for automatic classification of mitochondrial DNA haplogroups. Hum. Mutat., 32:25–32, Jan 2011. [9] B. Langmead, K. D. Hansen, and J. T. Leek. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol., 11:R83, 2010. [10] B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg. Searching for SNPs with cloud computing. Genome Biol., 10:R134, 2009. [11] R. E. Mills et al. Mapping copy number variation by population-scale genome sequencing. Nature, 470:59–65, Feb 2011. [12] Rackspace. http://www.rackspace.com. [13] M. C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25:1363–1369, Jun 2009. [14] M. C. Schatz. The missing graphical user interface for genomics. Genome Biol., 11:128, 2010. [15] L. Shi et al. The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies. BMC Bioinformatics, 9 Suppl 9:S10, 2008. [16] K. Wang, M. Li, D. Hadley, R. Liu, J. Glessner, S. F. A. Grant, H. Hakonarson, and M. Bucan. PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research, 17(11):1665–1674, Nov. 2007. [17] Wetterstrand, K. A. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available: http://www.genome.gov/sequencingcosts; Accessed 04/11/11. [18] H. E. Wichmann, C. Gieger, and T. Illig. KORA-gen–resource for population genetics, controls and a broad spectrum of disease phenotypes. Gesundheitswesen, 67 Suppl 1:26–30, Aug 2005. 114