=Paper= {{Paper |id=Vol-2363/paper7 |storemode=property |title=Secure Genome Processing in Public Cloud and HPC Environments |pdfUrl=https://ceur-ws.org/Vol-2363/paper7.pdf |volume=Vol-2363 |dblpUrl=https://dblp.org/rec/conf/iwsg/BrinkmannKL0SS17 }} ==Secure Genome Processing in Public Cloud and HPC Environments== https://ceur-ws.org/Vol-2363/paper7.pdf
                       9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017



     Secure Genome Processing in Public Cloud and
                 HPC Environments
              André Brinkmann∗ , Jürgen Kaiser∗ , Martin Löwer† , Lars Nagel∗ , Ugur Sahin† , Tim Süß∗
                                    ∗ Zentrum für Datenverarbeitung at JGU Mainz, Germany

                                     Email: {brinkman, kaiserj, nagell, suesst}@uni-mainz.de
                               † Translational Oncology at JGU Mainz (TRON gGmbH), Germany

                                        Email: {martin.loewer, ugur.sahin}@tron-mainz.de


   Abstract—Aligning next generation sequencing data requires        the transfer of sensitive genome data and their processing
significant compute resources. HPC and cloud systems can             secure – without impacting performance too much.
provide sufficient compute capacity, but do not offer the required      In this paper we present a solution to this problem which
data security guarantees. HPC environments are typically de-
signed for many groups of trusted users and often only include       outsources the computations to public multi-user HPC or cloud
minimal security enforcement, while Cloud environments are           facilities while ensuring data security and fast processing. The
mostly under the control of untrusted entities and companies.        techniques applied include adding noise, cryptography, and a
   In this work we present a scalable pipeline approach that         novel method for the parallel processing of genome data of
enables the use of public Cloud and HPC environments, while          many patients. We argue that the system is not only inexpen-
improving the patients’ privacy. The applied techniques include
adding noisy data, cryptography, and a MapReduce program for         sive, but also secure and show that the system’s performance
the parallel processing of data.                                     is only slightly degraded by our security measures.
   Keywords—genome sequencing, data security, MapReduce,                The remainder of the paper is structured as follows: In
Hadoop, pipeline architecture                                        Section II we describe the scenario in more detail and derive
                                                                     five requirements. In Section III we discuss tools and solutions
                      I. I NTRODUCTION                               from the literature. In Section IV we outline our approach
                                                                     before we give a more detailed description of our pipeline
   High-throughput sequencing, also known as next-generation         architecture in Section V. In the evaluation in Section VI
sequencing (NGS), is a revolutionary technology that enables         we check whether the requirements are fulfilled. Finally, we
the sequencing of entire genomes in a matter of hours.               conclude the paper in Section VII.
Common techniques like sequencing-by-synthesis require a lot
of computational power for post-processing the genome data,                          II. S CENARIO & R EQUIREMENTS
that is, for aligning them to a reference sequence / genome             Every day in biological and medical research, large amounts
and assembling them according to this mapping.                       of genomes are processed to determine their nucleotide se-
   NGS techniques produce large amounts of genome data in            quences. Besides research, genome sequencing is vital for
the form of short reads (about 50 to 300 bases) which usually        the personalized medicine of the present and future [10], for
need to be aligned to a reference genome [20]. NGS has not           example, for individualized immunotherapies which were con-
only revolutionized biological and medical research [19], but        sidered in the CI3 project1 . The techniques of high-throughput
also offers opportunities for the treatment of diseases. The         sequencing or next-generation sequencing have accelerated the
implementation of personalized medicine [10], for example,           process and made it possible to sequence the entire human
requires the analysis of human genomes at a large scale              genome in one day for less than $1000 [8][6].
and involves sequencing genome data from thousands of                   An important part of the process is performed on high-
individuals per year at one facility [7]. The alignment of such      performance computers. Since sequencers produce their (dig-
an amount of data is only possible by utilizing the processing       italized) output in the form of short reads (i.e., sequences),
power of large computing centers.                                    the reads need to be arranged and interpreted. This is done
   Current solutions typically do not involve public cloud           by aligning the reads to a reference genome which is a very
computing or academic high-performance computing (HPC)               costly operation requiring a lot of computational power. At
systems, but rely on expensive in-house facilities to ensure the     the same time, the data are highly confidential and must
privacy of the data. The reason is that the patients’ data could     not be vulnerable to attackers. It is therefore not possible to
be hijacked in public computing environments. HPC systems            simply transmit the data to a public cloud or high-performance
are shared by many groups of trusted users, and data security        computing environment. Yet, on the other hand, it is very / too
has a low priority. Cloud environments, on the other hand,           costly for many institutes to abandon such cheap solutions and
may provide a slightly better data security, but they are mostly     install a high-performance computing (HPC) facility in-house.
under the control of untrusted entities and companies. Hence,
to exploit external computing clusters, it is necessary to make        1 http://www.ci-3.de/en
                       9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


                                wp1         Cluster       wp1_chr1                          P1_wp1_chr1
probe_1.fastq                   wp2                       wp1_chr2                          P1_wp1_chr2                    probe_1.bam
     …                                                      …                Decrypt &          …                               …
                Anonymizer      wp3       MapReduce                                                            Merger
                                                                              Unmix                                        probe_n.bam
probe_n.fastq                    …           bwa          wp1_chrx                          P1_wp1_chrx
                                           samtools         …                                   …
                                wpm
                                                          wpm_chrx                          Pn_wpm_chrx

                                                 Fig. 1: Overview of the pipeline.


   For this reason, we consider the scenario in which a public          distrubuted file system HDFS [22]. The HDFS file system
HPC cluster is used, but under the condition that the data are          is distributed over the complete Hadoop cluster and stores
secure at any time in that an attacker might steal information,         the job-related data. The resource manager YARN allows to
but is not able to identify the person that these data belong to.       execute applications distributed on a Hadoop cluster and to
We assume that the sequencing has already been performed                reserve the required resources.
and that the short reads are provided as files. We regard the              There are many alternatives to Hadoop. In HPC systems,
environment where the data originate (e.g., some institute) as          one usually uses MPI for job parallelization [24]. Big vendors
a secure environment with strict access rights enforcement.             like Microsoft, Amazon or Google offer infrastructures for
Thus, we only consider the transmission to and the processing           processing big data in the cloud [21][23][11].
at the HPC cluster as vulnerable. It can be accessed by                    In recent years many tools for short-read alignment have
potentially malicious users, and no data being processed or             been developed. The first ones focused on reads that consist of
stored there can be deemed secure. So, malicious users may              at most 100 nucleotide bases. Due to the small read size it was
read the data and try to identify patients, either by reading the       assumed that the quality / correctness of the sequences was
identifiers of the sequences or by aligning the data to a (not          high. Examples for these first-generation tools are Bowtie [13],
necessarily complete) genome of the patient that they obtained          Burrows-Wheeler Aligner [14] and CUSHAW [18]. In our envi-
prior to the attack.                                                    ronment we use the Burrows-Wheeler Aligner in combination
   Altogether we have the following requirements:                       with SAMtools [16].
   1) The computationally expensive sequence alignment and                 New NGS sequencers are able to generate reads with longer
       assembly must be performed on a shared, potentially              base sequences. One drawback of these reads is that they con-
       vulnerable HPC cluster.                                          tain more errors which the alignment tool must compensate.
   2) The data should not be stored on disk in the cluster              Aligners that follow the seed-and-extend paradigm, can deal
       and must be completely deleted once the computation is           with an increased number of incorrect bases. Examples are
       finished.                                                        BWA-SW [15], Bowtie2 [12] or CUSHAW2 [17].
   3) Wherever possible, the data must be secured using                    There are some approaches that use big data techniques for
       strong encryption algorithms.                                    their short-read alignments. Similar to our approach, Abuin
   4) While it cannot be guaranteed that data are intercepted           et al. use the MapReduce framework Apache Spark [27]
       during computation in a public environment, it must not          for parallelizing the Burrows-Wheeler Aligner [2]. However,
       be possible to identify persons or probes based on the           they do not secure the processed data what disallows the
       data.                                                            usage of unsecure shared environments. Chen et al. presented
   5) There must be no significant degradation of the compu-            another approach based on the seed-and-extend paradigm that
       tational performance, for example, by using encryption           uses public and private computing clouds for their computa-
       or other means to provide data security.                         tions [3]. The public cloud is used to reduce the number of
                                                                        potential alignment positions while the reads’ final positions
   In our use case we considered a particular system where
                                                                        are determined in the private cloud. In contrast to our ap-
the data are generated by Illumina sequencers which produce
                                                                        proach, this method only aligns the reads of a single patient,
reads of roughly 50 to 300 nucleotides. The design and
                                                                        whereas our approach processes the data of many patients in
implementation of the system were influenced by this use case,
                                                                        parallel.
but most of the descriptions and results are valid for other
types of sequencers as well.                                                                   IV. A PPROACH
                                                                           In this section we shortly describe the main ideas and
                    III. R ELATED W ORK
                                                                        techniques of our approach before we give a more detailed
   For processing big data, Hadoop [26] is one of the best              description of our architecture in the next section. As we
known MapReduce frameworks. Embarrassingly parallel jobs                stated in the requirements, the new system must guarantee
can be easily mapped onto this framework. In a MapReduce                data security and perform fast parallel alignment of sequencing
framework tasks are mapped onto nodes for processing, and               data.
the intermediate results are then reduced to final results [5].            We use a pipeline architecture which we divide into four
Hadoop consists of three main components: the MapReduce                 phases. The pre- and postprocessing Phases I, III and IV in
framework itself, the resource manager YARN [25], and the               the secure environment package and sort the data, but they



                                                                    2
                       9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


mostly serve data security in Phase II. Phase II is the critical         in the Decrypt & Unmix phase. Finally, the Merger merges all
one in which the data are sent to and processed on the shared            files of a patient or probe into one final output file.
HPC cluster. During this complete phase the patient / probe                 A detailed explanation of the individual steps is given in
identifiers of the reads are encrypted which is possible as              the following subsections. The last subsection describes what
they are not needed for the alignment. The rest of the data              the user has to do to trigger the pipeline process.
is only encrypted during transmission, but vulnerable when it
is processed.                                                            A. Phase I: Anonymizer
   Since it is possible to identify patients based on parts of
                                                                            In the first phase, the Anonymizer reads the input data –
their DNA even if the respective genome sequences are in a
                                                                         consisting of single or paired-ended reads of many patients –
larger pool of reads [3][9], we additionally obfuscate potential
                                                                         from local storage in the secure area, adds encrypted identi-
attackers by mixing large amounts of reads, salting the mix
                                                                         fiers, salts the input with fake data, and creates work packages
with additional fake ones and randomizing the order of all
                                                                         (wp) of randomly mixed reads that are written back to local
reads. If the noisy sequences are chosen cleverly, it is much
                                                                         storage. In the next paragraphs we describe these tasks in more
more difficult for attackers to identify patients by applying
                                                                         detail.
statistical methods. In our pipeline, fake data is randomly
picked from a large, fixed set of genome data.                              The input is given as files in FASTQ format which can
   For data security, we finally ensure that data are never stored       be gzip-compressed. If compressed, the file names must end
in persistent memory in the cluster to reduce the number of              with .gz. The Anonymizer can handle single-ended as well as
side channels that could be used by attackers.                           paired-ended reads. It looks for single-ended input files using
                                                                         the regular expression .+.fastq.* and and for paired-ended
   With respect to performance, we exploit the fact that se-             input files using .+_[1|2].fastq.*. Two paired-ended
quences can be aligned independent of each other and use                 input files are assumed to belong together if their prefixes
a parallel MapReduce program. The principle of MapReduce                 before _1 and _2, repectively, are the same. The shared prefix
fits very well to our scenario [5]: the “Map” step aligns the            is then used as the identifier or probe name. The probe name of
sequences and sorts the aligned reads, the “Reduce” step                 an input file with single-ended reads is the file name without
collects and outputs the data chromosome-wise. We chose                  the .fastq ending.
Hadoop (with YARN) as MapReduce framework. For the                          For later processing stages, the patient / probe of each read
alignment we use the Burrows-Wheeler Aligner (bwa) [14]                  must be retraceable because the Anonymizer mixes reads from
and for sorting the SAMtools tool suite [16].                            several patients / probes into the same work package. If this
   The file formats that we use are a result of the design /             information was not added, the later stages could not map
software decisions. FASTQ is the format of the files pro-                the reads back to their respective patients / probes. For this
duced by Illumina sequencers. The Sequence Alignment/Map                 reason, the Anonymizer augments each read (pair) with a
(SAM) format “is a generic alignment format for storing                  patient / probe identifier, which is then piggybacked through
read alignments against reference sequences” and the Binary              the following stages. This is done before encryption. The
Alignment/Map (BAM) format is the binary representation of               Anonymizer adds the identifier to the first line of a read entry.
it [16]. Both formats are supported by the Burrows-Wheeler               More precisely, it prepends the patient / probe identifier to
Aligner and the SAMtools.                                                the existing read identifier separating both by a #. If, for
                                                                         example, the first line reads @foo and if the patient ID is
                V. P IPELINE A RCHITECTURE                               Pa, the concatenation becomes @Pa#foo.
   This section describes our pipeline architecture which we                The anonymization is achieved by combining the following
restrict to the computational part; i.e., we ignore the data             techniques:
acquisition by the sequencers and only consider the process                1) Salting: The actual data consisting of many different
between the point at which the input data in the form of short                probes are salted by throwing dummy reads into the
reads are gathered on computers in the safe environment and                   mix. These dummy reads contain, for example, rare
the point at which the aligned reads are written back to the safe             SNPs so that individual patients cannot be identified
environment. We assume that the short reads are contained in                  based on special genome segments. An attacker simply
FASTQ files and that the results are returned in BAM format.                  cannot distinguish between real and fake data which are
Hence, our solution consists of a pipeline that takes FASTQ                   processed like real data, but eventually filtered out in
files as input and outputs the aligned reads as BAM files.                    Phase III.
   The pipeline is shown in Figure 1 and can be divided into               2) Mixing: Real and fake data are randomly assigned to the
four phases. Each phase takes the output data of the previous                 work packages which are sent to the cluster and which
one as input and creates a new output. For security and                       form the input of Phase II. This way, the attacker cannot
performance reasons, the input data are first anonymized and                  gather the relationship of two reads from their positions
divided into work packages. In Phase II, each work package                    in the work packages.
is processed in its own Hadoop job in an HPC system. Back                  3) Encryption: The identifier is encrypted using the Ad-
in the secure area, the output is de-anomymized and reordered                 vanced Encryption Standard (AES) as defined in the



                                                                     3
                       9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


      U.S. Federal Information Processing Standards Publica-             output file named “header”. The ordering of the header lines
      tion 197 [1][4]. As we cannot maintain the encryption              is not changed during this process. The output files are
      ordering in the later decryption step, the electronic code         also in SAM format (but without header lines). The name
      book (ECB) mode is used. To avoid that equal plain text            of an output file specifies a patient / probe name, a work
      blocks are encrypted into identical ciphertext blocks, we          package name and a chromosome and follows the pattern
      salt the identifiers by adding random bits to them at fixed        __.
      positions. The Anonymizer supports AES-128, AES-192                   The decrypt & unmix tool iterates over the input files and,
      and AES-256 encryption. The respective key of 16, 24               in each file, over all reads in the given order. For each read,
      or 32 bytes can be provided in form of an input file. If           it decrypts its identifier and simply appends the read to the
      no (valid) keystore file is given, the Anomizer reads 32           output file of the respective patient / probe, work package and
      bytes from /dev/random and uses them as the key. Note              chromosome. If there is no such file, it is created.
      that the encrypted identifier is not directly written to the
      output file as it may include characters which render              D. Phase IV: Merger
      it invalid for later processing. Therefore, the encrypted             The final step in the pipeline is the Merger which merges
      data are base64-encoded before they are written to the             the output files of Phase III such that there is one single file
      output file.                                                       for each patient / probe.
B. Phase II: Hadoop                                                         The input has one file for each combination of patient /
                                                                         probe, work package and chromosome. Each file is a list of
   The second phase computes the alignment of the reads with             reads sorted by their position within the chromosome. Hence,
respect to a given reference genome. It is the only one executed         it remains to join the data for each patient / probe in one file
in a shared computing center. When the data are sent to the              using a merge operation similar to the one of the merge sort
computing center and back, OpenSSH is used for encryption.               algorithm. The Merger uses the regular expression .+_wp.+
The computation is organized subject to the MapReduce model              to identify input files. The prefix before the underscore is the
and uses the Hadoop framework with YARN.                                 identifier of the patient / probe.
   Each work package generated in Phase I is processed by                   The Merger must wait for Phase III to finish and can
a separate Hadoop job. As explained in Section IV, the                   therefore not run parallel to it. It iterates over the input files
Map function computes the alignments for all reads using                 in chromosome order and, in each file, over all reads in the
the Burrows-Wheeler Aligner and sorts the output reads by                given order. For each patient, it first merges the data from
their position in the reference genome using the SAMtools                all her files belonging to chromosome 1, then from all files
utilities. After that, the Hadoop partitioner partitions the reads       belonging to chromosome 2 and so on. In each step, it selects
for the Reduce function. Our partitioner partitions them by              the read with the lowest position and adds it the output file of
their chromosomes and sends them to the reducers where                   the according patient.
each reducer represents one chromosome. The actual Reduce
                                                                            While the entries in a chromosome file are already sorted
function outputs the data without further processing. For each
                                                                         by their mapping position in the reference genome, the chro-
chromosome one output file is created.
                                                                         mosome order must be established using the file “header”,
   Considered in more detail, a map task runs three instances
                                                                         provided that it was created in Phase III, or the reference
of the Burrows-Wheeler Aligner bwa. Assuming paired-ended
                                                                         genome’s annotation file (*.ann) or sequence dictionary
(single-ended) reads, it performs two bwa aln calls which
                                                                         (*.dict). If none of these files exist, the tool aborts.
align the reads and one bwa sampe (bwa samse) call
                                                                            The Merger uses heuristics and parameters to estimate the
which transforms the results to SAM format.
                                                                         number of merges that can be performed in parallel and
   To further improve security, the YARN environment is
                                                                         reserves CPU cores and memory accordingly.
configured to use only local RAM disks of the compute
nodes. Hence, no read is stored on persistent storage during             E. Pipeline Usage
computation.
                                                                           Before the pipeline process is started, the user has to define
C. Phase III: Decrypt & Unmix                                            input and output directory and provide the data. Once started,
   The Decrypt & Unmix stage processes the output (chromo-               the pipeline does not require any interventions from the user.
some) files of the Hadoop jobs. Our tool generates one output            So, the steps are:
file for every patient / probe and work package, loops through             1) Define workspace directory () and input
all reads in the input, decrypts their identifiers and sorts them              directory (e.g., /input).
in the right output files. Fake reads are discarded.                       2) Copy the configuration file to /input
   Input files are identified by the regular expression                        and modify it if necessary.
.+_chr.+, and all other files in the input directory are                   3) Call the starter script: starter -i /
ignored. It is assumed that the input files are in SAM format,                 input -w 
and missing header lines are tolerated. If the tool finds header           4) Wait for the pipeline to finish.
lines in one of the input files, it copies them to an additional           5) Obtain results from /results.



                                                                     4
                                            9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


                  40000       Abs. times pipeline                              Avg. processing times of the workp.          Concerning the quality of the encryption, we pointed out
                                            anonymize                    1.2          upload            cpy back         before that we have to use AES in ECB mode, but that we
                  35000                     processing                                mapred            DU
                                            merge                        1.0
                                                                                                                         avoid identical ciphertext blocks in the case of identical plain
                  30000
                                                                                                                         text blocks by adding random bits to the plain text. For this
time in seconds



                  25000                                                  0.8




                                                         relative time
                                                                                                                         reason, the encryption techniques, AES and OpenSSH, can be
                  20000                                                                                                  assumed to be strong.
                                                                         0.6
                  15000                                                                                                     For analyzing the fifth requirement, we ran tests in an HPC
                                                                         0.4                                             cluster. The “secure environment” is a remote computer, but
                  10000
                                                                         0.2                                             connected via a high-speed network link (10 GB/s). Figure 2
                   5000
                                                                                                                         shows processing times of the pipeline for an input data set
                      0                                                  0.0
                          1node    4nodes     16nodes                             1node     4nodes     16nodes           of 237M short reads (100 GB) and different cluster sizes,
                                                                                                                         each node having 64 cores and 128 GB memory. The left
  Fig. 2: Left: absolute runtimes of Anonymizer, Merger, and the
                                                                                                                         plot indicates that the processing scales extremely well. The
  Hadoop processing in between. Right: runtime for processing
                                                                                                                         running time is roughly quartered when the number of nodes
  the work packages relative to other operations (DU: decrypt
                                                                                                                         is quadrupled.
  & unmix).
                                                                                                                            The right figure shows the relative processing times of the
                                                                                                                         work packages. For a small number of nodes, the upload to the
                                                                                                                         cluster dominates the total time because the cluster nodes can
     The configuration specified in the configuration file is
                                                                                                                         store only a limited number of work packages in their memory
  related to the cluster side and can be determined by the
                                                                                                                         and the packages must remain in the secure area until space
  provider: Pipeline / cluster settings (size of work packages,
                                                                                                                         in the cluster nodes is freed. For a large number of nodes,
  #nodes, #threads, #pseudo samples, host name, user group
                                                                                                                         however, almost all the time is spent on processing, and the
  on host, batch queue on host etc.), tool settings (Hadoop,
                                                                                                                         overhead of using an outside cluster is neglectable.
  BWA, SAMtools and Java) and the location of additional data
                                                                                                                            In both figures, one can see that the runtimes of Phase I
  (reference genome, pseudo genome data).
                                                                                                                         (Anonymizer) and III (Decrypt & Unmix) are relatively short
     During the run, the pipeline creates a heavy IO workload
                                                                                                                         so that the overhead due to anonymizing / encryption and
  in the workspace directory. Hence, it is advisable to put the
                                                                                                                         deanonymizing / decryption is acceptable. Note that since
  workspace on a strong storage backend.
                                                                                                                         work package processing starts as soon as the first work
                                            VI. E VALUATION                                                              package arrives at the cluster, the runtimes of the Anonymizer
                                                                                                                         and the work package processing also overlap.
     In this section we evaluate whether the system meets the five
  requirements of Section II. The first and the second one are                                                                                     VII. C ONCLUSION
  obviously fulfilled because we perform all costly operations                                                              In this work we have presented a system for processing sen-
  on a public HPC cluster and we do this without storing any                                                             sitive genome data in a public environment without harming
  data on disk.                                                                                                          the privacy of patients. Our secure genome processing pipeline
     The third requirement is met because the data are during                                                            is already used for processing data of real cancer patients in
  transmission to and from the cluster. On the cluster nodes                                                             a university’s HPC cluster which is shared with many other
  themselves the sequences are readable, but for processing they                                                         users. As shown in our evaluation, the system’s processing
  have to be.                                                                                                            performance is very good and scales excellently.
     The identifiers of the patients / probes, however, remain
  encrypted even on the cluster nodes which helps with the                                                                                        ACKNOWLEDGMENT
  fourth requirement which is a bit more tricky. While it is not                                                            This work was supported by the German Federal Ministry
  possible to read the identifier without breaking the encryption,                                                       of Education and Research (BMBF) under grant 131A029D
  it might be possible to identify a person in the mix by reading                                                        (Project “CI3”).
  sufficiently many sequences, mapping them to an old reference
  genome of that person (provided that the attacker has one) and                                                                                      R EFERENCES
  computing the statistical probability for that person to be in                                                          [1] “Announcing the advanced encryption standard (aes),” Federal Informa-
  the mix. Chen et al. [3] and Homer et al. [9] showed that                                                                   tion Processing Standards Publication 197, Tech. Rep., 2001.
                                                                                                                          [2] J. M. Abuín, J. C. Pichel, T. F. Pena, and J. Amigo, “Sparkbwa: Speeding
  this is possible to a certain degree utilizing the existence of                                                             up the alignment of high-throughput dna sequencing data,” PLoS ONE,
  rare SNPs, but that it becomes more and more difficult the                                                                  vol. 11, no. 5, 2016.
  more other reads are in the mix. Unfortunately, there is no                                                             [3] Y. Chen, B. Peng, X. Wang, and H. Tang, “Large-scale privacy-
                                                                                                                              preserving mapping of human genomic sequences on hybrid clouds,”
  precise analysis of how many and what data the mix has to be                                                                in NDSS, 2012.
  salted with to render this attack ineffective. In our approach                                                          [4] J. Daemen and V. Rijmen, The Design of Rijndael. Secaucus, NJ, USA:
  we usually at least double the amount of data and pick reads                                                                Springer-Verlag New York, Inc., 2002.
                                                                                                                          [5] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on
  with rare SNPs so that the patients’ peculiarities do not stand                                                             large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
  out.                                                                                                                        [Online]. Available: http://doi.acm.org/10.1145/1327452.1327492




                                                                                                                     5
                           9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017


 [6] B.     J.    Fikes.    (2017)    New     Machines       Can     Sequence        2Nd USENIX Conference on Hot Topics in Cloud Computing, ser.
     Human        Genome       in    One     Day.      [Online].    Available:       HotCloud’10. Berkeley, CA, USA: USENIX Association, 2010, pp.
     http://www.sci-tech-today.com/news/Genome-Sequencing-in-One-Day/                10–10. [Online]. Available: http://dl.acm.org/citation.cfm?id=1863103.
     story.xhtml?story\_id=023001ATQM02                                              1863113
 [7] G. S. Ginsburg and H. F. Willard, Eds., Genomic and Personalized
     Medicine (Second Edition), second edition ed. Academic Press,
     2013. [Online]. Available: http://www.sciencedirect.com/science/article/
     pii/B978012382227700121X
 [8] J. M. Heather and B. Chain, “The sequence of sequencers: The
     history of sequencing {DNA},” Genomics, vol. 107, no. 1, pp. 1 – 8,
     2016. [Online]. Available: http://www.sciencedirect.com/science/article/
     pii/S0888754315300410
 [9] N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe,
     J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson, and D. W. Craig,
     “Resolving individuals contributing trace amounts of dna to highly
     complex mixtures using high-density snp genotyping microarrays,” PLoS
     Genet, vol. 4, no. 8, pp. 1–9, 08 2008.
[10] S. Kreiter, M. Vormehr, N. van de Roemer, M. Diken, M. Löwer,
     J. Diekmann, S. Boegel, B. Schrörs, F. Vascotto, J. C. Castle, A. D.
     Tadmor, S. P. Schoenberger, C. Huber, Özlem Türeci, and U. Sahin,
     “Mutant mhc class ii epitopes drive therapeutic immune responses to
     cancer,” Nature, vol. 520, pp. 692 – 696, Mar. 2015.
[11] S. T. Krishnan and J. U. Gonzalez, Building Your Next Big Thing
     with Google Cloud Platform: A Guide for Developers and Enterprise
     Architects, 1st ed. Berkely, CA, USA: Apress, 2015.
[12] B. Langmead and S. L. Salzberg, “"fast gapped-read alignment with
     bowtie 2.",” Nature methods, vol. 9, no. 4, p. 357–359, Mar 2012.
[13] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast
     and memory-efficient alignment of short dna sequences to the human
     genome,” Genome Biology, vol. 10, no. 3, p. R25, 2009.
[14] H. Li and R. Durbin, “Fast and accurate short read alignment with
     Burrows-Wheeler transform,” Bioinformatics, vol. 25, no. 14, pp.
     1754–1760, Jul 2009. [Online]. Available: http://dx.doi.org/10.1093/
     bioinformatics/btp324
[15] ——, “"fast and accurate long-read alignment with burrows–wheeler
     transform.",” Bioinformatics, vol. 26, no. 5, p. 589–595, Mar 2010.
[16] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer,
     G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/Map
     Format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp.
     2078–2079, Aug. 2009. [Online]. Available: http://dx.doi.org/10.1093/
     bioinformatics/btp352
[17] Y. Liu and B. Schmidt, “Long read alignment based on maximal exact
     match seeds,” Bioinformatics, vol. 28, no. 18, pp. i318–i324, 2012.
[18] Y. Liu, B. Schmidt, and D. L. Maskell, “Cushaw: a cuda compatible short
     read aligner to large genomes based on the burrows–wheeler transform,”
     Bioinformatics, vol. 28, no. 14, p. 1830, 2012.
[19] S. Marguerat and J. Bähler, “Rna-seq: from technology to biology.”
     Cell Mol Life Sci, vol. 67, no. 4, pp. 569–579, Feb 2010. [Online].
     Available: http://dx.doi.org/10.1007/s00018-009-0180-6
[20] S. Moorthie, C. J. Mattocks, and C. F. Wright, “Review of massively
     parallel dna sequencing technologies,” Hugo Journal, vol. 5, no. 1-4,
     pp. 1 – 12, 2011.
[21] R. Nadipalli, HDInsight Essentials, 2nd ed. Packt Publishing, 2015.
[22] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop
     distributed file system,” in Proceedings of the 2010 IEEE 26th
     Symposium on Mass Storage Systems and Technologies (MSST), ser.
     MSST ’10. Washington, DC, USA: IEEE Computer Society, 2010,
     pp. 1–10. [Online]. Available: http://dx.doi.org/10.1109/MSST.2010.
     5496972
[23] A. Singh and V. Rayapati, Learning Big Data with Amazon Elastic
     MapReduce. Packt Publishing, 2014.
[24] M. Snir, S. W. Otto, D. W. Walker, J. Dongarra, and S. Huss-Lederman,
     MPI: The Complete Reference. Cambridge, MA, USA: MIT Press,
     1995.
[25] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,
     R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,
     O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache
     hadoop yarn: Yet another resource negotiator,” in Proceedings of
     the 4th Annual Symposium on Cloud Computing, ser. SOCC ’13.
     New York, NY, USA: ACM, 2013, pp. 5:1–5:16. [Online]. Available:
     http://doi.acm.org/10.1145/2523616.2523633
[26] T. White, Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012.
[27] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
     “Spark: Cluster computing with working sets,” in Proceedings of the




                                                                                 6