A feedback guided interface for
                                      elastic computing

                      Sebastian Schönherr1,2,∗ , Lukas Forer1,2,∗ , Hansi Weißensteiner1,2 ,
                      Florian Kronenberg2 , Günther Specht1 , Anita Kloss-Brandstätter2
                                            ∗
                                              contributed equally
                                               1
                                                   Databases and Information Systems
                                                     Institute of Computer Science
                                                    University of Innsbruck, Austria
                                             sebastian.schoenherr@uibk.ac.at
                                           2
                                             Division of Genetic Epidemiology
                          Department of Medical Genetics, Molecular and Clinical Pharmacology
                                         Innsbruck Medical University, Austria
                                                      lukas.forer@i-med.ac.at

ABSTRACT                                                                1.   INTRODUCTION
Computer Science plays an important role in today’s Genet-                 In recent years Computer Science became an essential
ics. New sequencing methods produce an enormous amount                  part in the field of Genetics. Especially through the advent
of data, pushing genetic laboratories to storage and com-               of Next Generation Sequencing (NGS), whereby a human
putational limits. New approaches are needed to eliminate               genome (3 billion base pairs/chromosome set) can be se-
these shortcomings and provide possibilities to reproduce               quenced in acceptable time, the amount of data is growing
current solutions and algorithms in the area of Bioinformat-            significantly, exceeding all known dimensions in Genetics.
ics. In this paper a system is proposed which simplifies the            Figure 1 shows a comparison between the reducing DNA se-
access to computational resources and associated compu-                 quencing costs and Moore’s law. Moore’s law is used as a
tational models of cluster architectures, assists end users in          reference to show that computer hardware can currently not
executing and monitoring developed algorithms via a web in-             keeping pace with the progress in DNA sequencing. Further-
terface and provides an interface to add future developments            more, the amount of complete sequenced individuals is grow-
or any kind of programs. We demonstrate on existing algo-               ing exponentially from year to year [11], making new models
rithms how an integretation can be done with little effort,             necessary. For instance, to store the data of one complete
making it especially useful for the evaluation and simplified           human DNA (Deoxyribonucleic acid) in raw format with 30-
usage of current algorithms.                                            times coverage, 30 terabytes of data is produced.
                                                                           In the area of Copy Number Variations, a possible cause
Categories and Subject Descriptors                                      for many complex genetic disorders, high throughput algo-
                                                                        rithms are needed to process and analyze several hundred
H.4 [Information Systems Applications]: Miscellaneous                   gigabytes of raw input data [16] [6], yielding to a wall time
                                                                        of up to one week for a typical study size [18]. This remark-
General Terms                                                           able increase of data and time causes genetic departments
                                                                        to consider new ways of importing and storing data as well
Distributed System, Experimentation, Application
                                                                        as improving performance of current algorithms.
                                                                           Cluster architectures in connection with associated mod-
Keywords                                                                els have the potential to solve this issue, but especially for
Bioinformatics, Hadoop, MapReduce, Cloud computing                      small departments often gainless and unaffordable. Using
                                                                        clusters on demand, also referred to Infrastructure as a Ser-
                                                                        vice (IaaS), builds therefore a good opportunity to circle
                                                                        these issues. To capitalize the full potential of IaaS, a com-
                                                                        bination with distribution models like MapReduce [5] is for
                                                                        specific applications both possible and obvious. Several iso-
                                                                        lated applications [9], [10], [14] already exist using a dis-
                                                                        tributed approach for storing data and processing algorithms.
                                                                        But since no general system is given to execute those solu-
                                                                        tions, an evaluation and reproducibility is often not feasible.
                                                                        Scientists need to setup a cluster on their own or using a
                                                                        provided remote cluster architecture to evaluate a published
23rd GI-Workshop on Foundations of Databases (Grundlagen von Daten-     algorithm, being both time wasting and insecure for sensi-
banken), 31.05.2011 - 03.06.2011, Obergurgl, Austria.                   tive data.
Copyright is held by the author/owner(s).


                                                                  109
                                                              95.263.072 $                                                                 Moore’s law
                                                                                                                                           Cost per Genome
                                        100.000.000 $


                                                                                                        13.801.124 $

                                         10.000.000 $
     Cost per Genome in $ (log scale)


                                                                                                                           advent of NGS


                                                                             11.732.535 $

                                                                                                             3.063.820 $
                                          1.000.000 $


                                                                                                                            154.714 $
                                            100.000 $


                                                                                                                                                             29.092 $


                                             10.000 $
                                                        Sept01 Sept02 Oct03 Apr04 Oct04 Apr05 Oct05 Apr06 Okt06 Apr07 Oct07 Apr08 Oct08 Apr09 Oct09 Apr10 Oct10
                                                            Mar02 Mar03 Jan04 Jul04 Jan05 Jul05 Jan06 Jul06 Jan07 Jul07 Jan08 Jul08 Jan09 Jul09 Jan10 Jul10

                                                                                                         Date

                                                   Figure 1: Comparision of DNA sequencing cost with Moore’s law; Data from [17]


   In this paper we present the idea to build an integrated                                            by step and distributes only whole jobs among the cluster.
system for scientists in the area of Bioinformatics to (1) get
access to distributed cluster architectures and execute ex-
isting algorithms, (2) build maintainable and reproducible
                                                                                                       3.    ARCHITECTURE
workflows and (3) provide an interface to add future de-                                                  A modular architecture is suggested in Figure 2, separat-
velopments or any kind of programs to the system without                                               ing the process of instantiate and set up a cluster (Cloud-
detailed IT knowledge. The reminder of this paper is struc-                                            gene) from the process of monitor and run a program (EMI ).
tured as follows: Section 2 gives an overview of the related                                           Based on open source frameworks like Apache Hadoop [2]
work. In section 3 the architecture of our suggested system                                            and Apache Whirr [3], we implemented a prototype to ver-
is explained in more detail with potential case studies in sec-                                        ify our approach. The user utilizes Cloudgene to set up a
tion 4. Section 5 shows necessary future work and the paper                                            cluster architecture to his needs through XML configuration
ends with a conclusion in section 6.                                                                   files. This allows adding new algorithms dynamically with-
                                                                                                       out digging into Cloudgene to deep. A fully operable and
                                                                                                       customized cluster is then provided, including all necessary
2.               RELATED WORK                                                                          user data. In a subsequent step EMI (Elastic MapReduce
   Cluster solutions guided by a web-interface to execute dis-                                         Interface) is launched on the master node of the cluster.
tributed algorithms like Myrna [9], CrossBow [10] or Cloud-                                            EMI can be seen as an abstraction of the underlying system
Burst [13] already exist. Unfortunately, the user must login                                           architecture from the end user, lies on top of the integrated
to the Amazon Web Services (AWS) console to monitor the                                                programs and allows the user to communicate and interact
progress of executed jobs or to shutdown the cluster after                                             with the cluster as well as receive feedback of currently exe-
execution. Additionaly, a data storage in S3 buckets is of-                                            cuted workflows (see Figure 3). EMI can be disabled in case
ten required and a custom web interface needs to be imple-                                             a program already includes an interface by its own, yield-
mented for every single approach.                                                                      ing to the most general approach to execute any kind of
Galaxy [7] is a software system which facilitates the creation,                                        developed solution. Both parts can be operated separately
execution and maintainability of pipelines in a fast and user                                          via configuration files with clear defined input and output
friendly way. The platform itself executes the scripts and the                                         variables.
user has the possibility to monitor the progress. Galaxy’s ex-
tension CloudMan [1] provides the possibility to install and                                           3.1      Cloudgene
execute Galaxy on Amazon EC2 (Elastic Compute Cloud).                                                    Amazon provides with its EC2 the currently most devel-
However, the user needs to start the master node manually                                              oped service for public clouds in the area of IaaS. Cloudgene
by using the AWS console and Galaxy does not provide a                                                 supports besides EC2 also Rackspace [12] to provide access
native support of Hadoop programs, executes modules step                                               to cluster infrastructure. As mentioned in the introduction


                                                                                                110
                                                                a combination with MapReduce is useful: In this paradigm,
                                                                the master node chops up data into chunks and distributes
                                                                it over all active worker nodes (map step). Subsequently,
                                                                the master node reassigns coherent map results to worker
                                                                nodes (sort and shuffle) to calculate the final result (reduce
                                     Custom Programs
                                                                step). For this project Apache Hadoop’s implementation of
            Hadoop        EMI         CloudBurst    ...         MapReduce and its distributed file system (HDFS) are used.
                                                                Using Whirr as a connector, Cloudgene is able to instance
                                                                a full working EC2 or Rackspace cluster for end users with
                                                                various defined properties and copies the necessary program
                                                                data and configuration files to the cluster. Examples for de-
                                                                fined variables could be the desired image, amount and kind
                                                                of instances, HDFS options, MapReduce properties and the
                                                                user’s SSH public key. Amazon already provides several pre-
                                Web Container
Cloudgene


                                                                defined images for all sorts of use cases, which can be be
             Whirr
                                                                used with Cloudgene (e.g. http://www.cloudbiolinux.com).
                           Restlet       ExtJS                  Cloudgene takes over the customization of predefined images
                                                                and installs services like MapReduce, in our case included in
                                                                Cloudera’s distribution of Apache Hadoop [4]. The cluster
                                                                configuration is defined in an XML-based file format, includ-
                                                                ing all necessary information for a successful cluster boot.
                                                                Cloudgene routinely checks if new configurations are added
             XML                                Access          and offers the possibility to execute newly defined programs.
            Config                              Manager         Since EC2 is using a pay-per-use model, end users must
                                                                provide their Amazon Access ID and Secret Key, which is
                                                                transferred via Cloudgene to Amazon in a secure way. Alter-
                                                                natively, Cloudgene can also be launched on every machine
Figure 2: Architecture of the suggested system in-
                                                                having Java installed, eliminating the transfer via our server.
cluding Cloudgene and EMI
                                                                Cloudgene solves one important issue and gives genetic de-
                                                                partments access to computational power and storage. A
                                                                still unresolved problem is the lack of a graphical user inter-
                                                                face to control jobs deriving from command line based ap-
                                                                plications. Especially the need of putting enormous amount
                                                                of local data into HDFS has to be considered. To overcome
                                                                these shortcomings, a user interface (EMI) was designed.

                                                                3.2    Efficient MapReduce Interface (EMI)
                                                                   Running Hadoop MapReduce programs on a cluster re-
                                                                quires the execution of several non-trivial steps: First, the
                                                                user must upload all input data to the master node, copy the
                                                                data into the proprietary HDFS, run the Hadoop MapRe-
                                                                duce job, export the results from the filesystem and finally
                                                                download them to the local workstation. For researchers
                                                                without expertise in Computer Science these tasks turns out
                                                                to be very challenging. For this purpose we developed EMI
                                                                which facilitates the execution, monitoring and evaluation
                                                                of MapReduce jobs. A web interface, which runs on the
                                                                master node of the cluster, enables the execution of jobs
                                                                through well-structured wizards and setting all required pa-
                                                                rameters step by step. As several studies have shown, repro-
                                                                ducibility of data analysis is one of the greatest problems in
                                                                biomedical publications [15]. For this purpose the execution
                                                                of a MapReduce job with its parameters and input data is
                                                                logged, thus a fast comparison of experiments with differ-
                                                                ent settings is possible. Moreover, the user always has the
                                                                full control over an execution of each job and can monitor
Figure 3: Workflow of the system including Cloud-               its current progress and status. All running jobs are listed
gene and EMI                                                    whereby the progress of the map and reduce phase are dis-
                                                                played separately. Since using resources from Amazon costs
                                                                money, EMI informs the user about the uptime of the clus-
                                                                ter and the number of rented instances (Figure 4).
                                                                The modular architecture enables a fast integration of any
                                                                Hadoop job which could be normally executed through the


                                                          111
command line. A simple and clear XML configuration file               <emi>
describes the input and output parameters of the program                <program>
and contains other relevant information that are necessary                <name>CloudBurst</name>
to start the job (see Section 4). In addition to this file, a zip         <command>
archive file exists which contains all software relevant data                hadoop jar emi/cloudburst/CloudBurst.jar \
(e.g. jar file, meta data, configuration files). With those                     $input1 $input2 $output1 36 36 3 0 1 240 \
files, EMI automatically generates a web interface in which                     48 24 24 128 16
the possibility to set each defined parameter through wiz-                </command>
ards and to run the defined job by a single click is provided.            <input>
As mentioned earlier, all input data must be put into the ro-               <param id="1" type="hdfs">
bust and fault-tolerant HDFS. As this process is very time-                   <name>Reference Genome</name>
intensive an error prone, EMI supports the user by provid-                    <default>data/cloudburst/s_suis.br</default>
ing a wizard which enables the import of data from different                </param>
sources (FTP, HTTP, Amazon S3 buckets or local file up-                   </input>
loads). In addition, files defined as output parameters can               <input>
be exported and downloaded as a zip archive or can be up-                   <param id="2" type="hdfs">
loaded to Amazon S3 or FTP servers. EMI supports a multi-                     <name>Reads</name>
user mode whereby all data by a certain user are password                     <default>data/cloudburst/100k.br</default>
protected and executed jobs are scheduled through a queue                   </param>
system. Overall, EMI is fully independent from Cloudgene                  </input>
and can be installed on a local Hadoop cluster too.                       <output>
                                                                            <param id="1" type="hdfs" merge="true">
4.    CASE STUDIES                                                            <name>Results</name>
                                                                              <default>data/cloudburst/results</default>
  In this section we explain how new programs can be in-
                                                                            </param>
tegrated into Cloudgene and EMI. Based on two different
                                                                          </output>
biomedical software solutions we demonstrate the diversity
                                                                        </program>
and simplicity of our approach.
                                                                      </emi>
4.1    CloudBurst                                                        After the XML file is uploaded to the Cloudgene server,
  CloudBurst is a parallel read-mapping algorithm to map              the user starts a web browser to (1) login to Cloudgene, (2)
NGS data to the human genome and other reference genomes              start up a cluster preconfigured with CloudBurst and (3)
[13]. It is implemented as a MapReduce program using                  run and monitor jobs with EMI (Figure 4).
Hadoop and can be executed with the following command:                   Compared to a standard manual approach, this eliminates
                                                                      error-prone and time-consuming tasks such as (1) setting up
hadoop jar emi/cloudburst/CloudBurst.jar \                            a cluster and connecting via the command line onto the mas-
  reference_genome reads results 36 36 3 0 1 240 \                    ter node, (2) uploading and importing data into HDFS, (3)
  48 24 24 128 16                                                     exporting final results from HDFS and downloading them
                                                                      and (4) executing and reproducing MapReduce jobs with
   In order to execute CloudBurst we create a configuration
                                                                      different configurations via a web interface. This shows,
file for Cloudgene which starts a Hadoop cluster on Amazon
                                                                      that an easy integration can be done using a simple XML
EC2 with a standard Ubuntu Linux with open Hadoop ports
                                                                      configuration, supporting and guiding researchers as far as
50030 and 50070. The corresponding XML has the following
                                                                      possible.
structure:
<cloudgene>
                                                                      4.2   HaploGrep
  <name>CloudBurst</name>                                                HaploGrep is a reliable algorithm implemented in a web
  <options>                                                           application to determine the haplogroup affiliation of thou-
    <option name="provider" value="amazon-aws"/>                      sands of mitochondrial DNA (mtDNA) profiles genotyped
    <option name="image" value="default"/>                            for the entire mtDNA or any part of it [8]. As HaploGrep
    <option name="service" value="hadoop"/>                           provides its own web interface we do not need to install EMI.
    <option name="emi" value="true"/>                                 Since it does not use the Hadoop service either, we note this
    <option name="ports" value="50030 50070"/>                        option in the configuration as well. HaploGrep listens on
  </options>                                                          the ports 80 (http) and 443 (https), therefore this ports are
</cloudgene>                                                          marked as open. The configuration file for Cloudgene with
                                                                      all requirements looks as follows:
   As CloudBurst has no graphical user interface, we install
EMI on the Amazon EC2 cluster and use it for user inter-              <cloudgene>
actions. For this purpose the command above with its ar-                <name>Haplogrep</name>
guments must be translated into the following configuration             <options>
file:                                                                     <option name="provider" value="amazon-aws"/>
                                                                          <option name="image" value="default"/>
                                                                          <option name="service" value="none"/>
                                                                          <option name="emi" value="false"/>
                                                                          <option name="ports" value="80 443"/>


                                                                112
                                       Figure 4: Workflow based on CloudBurst


  </options>                                                      system. Its modular architecture enables a fast integration
</cloudgene>                                                      of any Hadoop job which could be only executed through
                                                                  the command line. By hiding the low-level informatics, it is
  After the cluster setup is finalized, Cloudgene returns a       the ideal system for researchers without deeper knowledge in
web address which points to the installed instance of Hap-        Computer Science. Moreover, our system is not constricted
loGrep.                                                           to the life sciences and can be used in nearly every applica-
                                                                  tion range. Overall, it is a first approach in order to narrow
5.   FUTURE WORK                                                  the gap between cloud-computing and usability.
   One of the biggest advantages of IaaS is the changable
amount of needed datanodes on demand. Thus, the next              7.   ACKNOWLEDGMENTS
version of Cloudgene is conceived to provide functions for
                                                                    Sebastian Schönherr was supported by a scholarship from
adding and removing instances during runtime. Currently,
                                                                  the University of Innsbruck (Doktoratsstipendium aus der
clusters started with Cloudgene are not data persistent which
                                                                  Nachwuchsförderung, MIP10/2009/3). Hansi Weißensteiner
yields to a data loss after a shutdown is fulfilled. For this
                                                                  was supported by a scholarship from the Autonomous Pro-
purpose we plan to store all results on persistent Amazon
                                                                  vince of Bozen/Bolzano (South Tyrol). The project was
EBS volumes. Furthermore, a simple user interface for Ha-
                                                                  supported by the Amazon Research Grant. We thank the
doop is not only useful for the end user but also for devel-
                                                                  Whirr Mailinglist especially Tom White and Andrei Savu
opers. It supports them during the whole prototyping and
                                                                  for their assistance.
testing process of novel MapReduce algorithms by highlight-
ing performance bottlenecks. Thus, we plan to implement
time measurements of the map, reduce and shuffle phase            8.   REFERENCES
and to visualize them in an intuitive chart. Additionally,         [1] E. Afgan, D. Baker, N. Coraor, B. Chapman,
Hadoop plans in its next generation approach to support al-            A. Nekrutenko, and J. Taylor. Galaxy CloudMan:
ternate programming paradigms to MapReduce, what is par-               delivering cloud compute clusters. BMC
ticularly important for applications (e.g. K-Means) where              Bioinformatics, 11 Suppl 12:S4, 2010.
custom frameworks out-perform MapReduce by an order of             [2] Apache Hadoop. http://hadoop.apache.org.
magnitude.                                                         [3] Apache Whirr. http://incubator.apache.org/whirr/.
                                                                   [4] Cloudera. http://www.cloudera.com/.
6.   CONCLUSION                                                    [5] J. Dean and S. Ghemawat. MapReduce: simplified
  We presented a software system for running and maintain-             data processing on large clusters. In OSDI’04:
ing elastic computer clusters. Our approach combines the               Proceedings of the 6th conference on Symposium on
individual steps of setting up a cluster into a user-friendly          Opearting Systems Design & Implementation, pages


                                                            113
     10–10, Berkeley, CA, USA, 2004. USENIX
     Association.
 [6] L. Forer, S. Schönherr, H. Weißensteiner, F. Haider,
     T. Kluckner, C. Gieger, H. E. Wichmann, G. Specht,
     F. Kronenberg, and A. Kloss-Brandstätter. CONAN:
     copy number variation analysis software for
     genome-wide association studies. BMC
     Bioinformatics, 11:318, 2010.
 [7] J. Goecks, A. Nekrutenko, J. Taylor, E. Afgan,
     G. Ananda, D. Baker, D. Blankenberg,
     R. Chakrabarty, N. Coraor, J. Goecks, G. Von Kuster,
     R. Lazarus, K. Li, A. Nekrutenko, J. Taylor, and
     K. Vincent. Galaxy: a comprehensive approach for
     supporting accessible, reproducible, and transparent
     computational research in the life sciences. Genome
     Biol., 11:R86, 2010.
 [8] A. Kloss-Brandstättter, D. Pacher, S. Schönherr,
     H. Weißensteiner, R. Binna, G. Specht, and
     F. Kronenberg. HaploGrep: a fast and reliable
     algorithm for automatic classification of mitochondrial
     DNA haplogroups. Hum. Mutat., 32:25–32, Jan 2011.
 [9] B. Langmead, K. D. Hansen, and J. T. Leek.
     Cloud-scale RNA-sequencing differential expression
     analysis with Myrna. Genome Biol., 11:R83, 2010.
[10] B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L.
     Salzberg. Searching for SNPs with cloud computing.
     Genome Biol., 10:R134, 2009.
[11] R. E. Mills et al. Mapping copy number variation by
     population-scale genome sequencing. Nature,
     470:59–65, Feb 2011.
[12] Rackspace. http://www.rackspace.com.
[13] M. C. Schatz. CloudBurst: highly sensitive read
     mapping with MapReduce. Bioinformatics,
     25:1363–1369, Jun 2009.
[14] M. C. Schatz. The missing graphical user interface for
     genomics. Genome Biol., 11:128, 2010.
[15] L. Shi et al. The balance of reproducibility, sensitivity,
     and specificity of lists of differentially expressed genes
     in microarray studies. BMC Bioinformatics, 9 Suppl
     9:S10, 2008.
[16] K. Wang, M. Li, D. Hadley, R. Liu, J. Glessner,
     S. F. A. Grant, H. Hakonarson, and M. Bucan.
     PennCNV: An integrated hidden Markov model
     designed for high-resolution copy number variation
     detection in whole-genome SNP genotyping data.
     Genome Research, 17(11):1665–1674, Nov. 2007.
[17] Wetterstrand, K. A. DNA Sequencing Costs: Data
     from the NHGRI Large-Scale Genome Sequencing
     Program Available:
     http://www.genome.gov/sequencingcosts; Accessed
     04/11/11.
[18] H. E. Wichmann, C. Gieger, and T. Illig.
     KORA-gen–resource for population genetics, controls
     and a broad spectrum of disease phenotypes.
     Gesundheitswesen, 67 Suppl 1:26–30, Aug 2005.


                                                              114