=Paper=
{{Paper
|id=Vol-2363/paper7
|storemode=property
|title=Secure Genome Processing in Public Cloud and HPC Environments
|pdfUrl=https://ceur-ws.org/Vol-2363/paper7.pdf
|volume=Vol-2363
|dblpUrl=https://dblp.org/rec/conf/iwsg/BrinkmannKL0SS17
}}
==Secure Genome Processing in Public Cloud and HPC Environments==
9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017
Secure Genome Processing in Public Cloud and
HPC Environments
André Brinkmann∗ , Jürgen Kaiser∗ , Martin Löwer† , Lars Nagel∗ , Ugur Sahin† , Tim Süß∗
∗ Zentrum für Datenverarbeitung at JGU Mainz, Germany
Email: {brinkman, kaiserj, nagell, suesst}@uni-mainz.de
† Translational Oncology at JGU Mainz (TRON gGmbH), Germany
Email: {martin.loewer, ugur.sahin}@tron-mainz.de
Abstract—Aligning next generation sequencing data requires the transfer of sensitive genome data and their processing
significant compute resources. HPC and cloud systems can secure – without impacting performance too much.
provide sufficient compute capacity, but do not offer the required In this paper we present a solution to this problem which
data security guarantees. HPC environments are typically de-
signed for many groups of trusted users and often only include outsources the computations to public multi-user HPC or cloud
minimal security enforcement, while Cloud environments are facilities while ensuring data security and fast processing. The
mostly under the control of untrusted entities and companies. techniques applied include adding noise, cryptography, and a
In this work we present a scalable pipeline approach that novel method for the parallel processing of genome data of
enables the use of public Cloud and HPC environments, while many patients. We argue that the system is not only inexpen-
improving the patients’ privacy. The applied techniques include
adding noisy data, cryptography, and a MapReduce program for sive, but also secure and show that the system’s performance
the parallel processing of data. is only slightly degraded by our security measures.
Keywords—genome sequencing, data security, MapReduce, The remainder of the paper is structured as follows: In
Hadoop, pipeline architecture Section II we describe the scenario in more detail and derive
five requirements. In Section III we discuss tools and solutions
I. I NTRODUCTION from the literature. In Section IV we outline our approach
before we give a more detailed description of our pipeline
High-throughput sequencing, also known as next-generation architecture in Section V. In the evaluation in Section VI
sequencing (NGS), is a revolutionary technology that enables we check whether the requirements are fulfilled. Finally, we
the sequencing of entire genomes in a matter of hours. conclude the paper in Section VII.
Common techniques like sequencing-by-synthesis require a lot
of computational power for post-processing the genome data, II. S CENARIO & R EQUIREMENTS
that is, for aligning them to a reference sequence / genome Every day in biological and medical research, large amounts
and assembling them according to this mapping. of genomes are processed to determine their nucleotide se-
NGS techniques produce large amounts of genome data in quences. Besides research, genome sequencing is vital for
the form of short reads (about 50 to 300 bases) which usually the personalized medicine of the present and future [10], for
need to be aligned to a reference genome [20]. NGS has not example, for individualized immunotherapies which were con-
only revolutionized biological and medical research [19], but sidered in the CI3 project1 . The techniques of high-throughput
also offers opportunities for the treatment of diseases. The sequencing or next-generation sequencing have accelerated the
implementation of personalized medicine [10], for example, process and made it possible to sequence the entire human
requires the analysis of human genomes at a large scale genome in one day for less than $1000 [8][6].
and involves sequencing genome data from thousands of An important part of the process is performed on high-
individuals per year at one facility [7]. The alignment of such performance computers. Since sequencers produce their (dig-
an amount of data is only possible by utilizing the processing italized) output in the form of short reads (i.e., sequences),
power of large computing centers. the reads need to be arranged and interpreted. This is done
Current solutions typically do not involve public cloud by aligning the reads to a reference genome which is a very
computing or academic high-performance computing (HPC) costly operation requiring a lot of computational power. At
systems, but rely on expensive in-house facilities to ensure the the same time, the data are highly confidential and must
privacy of the data. The reason is that the patients’ data could not be vulnerable to attackers. It is therefore not possible to
be hijacked in public computing environments. HPC systems simply transmit the data to a public cloud or high-performance
are shared by many groups of trusted users, and data security computing environment. Yet, on the other hand, it is very / too
has a low priority. Cloud environments, on the other hand, costly for many institutes to abandon such cheap solutions and
may provide a slightly better data security, but they are mostly install a high-performance computing (HPC) facility in-house.
under the control of untrusted entities and companies. Hence,
to exploit external computing clusters, it is necessary to make 1 http://www.ci-3.de/en
9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017
wp1 Cluster wp1_chr1 P1_wp1_chr1
probe_1.fastq wp2 wp1_chr2 P1_wp1_chr2 probe_1.bam
… … Decrypt & … …
Anonymizer wp3 MapReduce Merger
Unmix probe_n.bam
probe_n.fastq … bwa wp1_chrx P1_wp1_chrx
samtools … …
wpm
wpm_chrx Pn_wpm_chrx
Fig. 1: Overview of the pipeline.
For this reason, we consider the scenario in which a public distrubuted file system HDFS [22]. The HDFS file system
HPC cluster is used, but under the condition that the data are is distributed over the complete Hadoop cluster and stores
secure at any time in that an attacker might steal information, the job-related data. The resource manager YARN allows to
but is not able to identify the person that these data belong to. execute applications distributed on a Hadoop cluster and to
We assume that the sequencing has already been performed reserve the required resources.
and that the short reads are provided as files. We regard the There are many alternatives to Hadoop. In HPC systems,
environment where the data originate (e.g., some institute) as one usually uses MPI for job parallelization [24]. Big vendors
a secure environment with strict access rights enforcement. like Microsoft, Amazon or Google offer infrastructures for
Thus, we only consider the transmission to and the processing processing big data in the cloud [21][23][11].
at the HPC cluster as vulnerable. It can be accessed by In recent years many tools for short-read alignment have
potentially malicious users, and no data being processed or been developed. The first ones focused on reads that consist of
stored there can be deemed secure. So, malicious users may at most 100 nucleotide bases. Due to the small read size it was
read the data and try to identify patients, either by reading the assumed that the quality / correctness of the sequences was
identifiers of the sequences or by aligning the data to a (not high. Examples for these first-generation tools are Bowtie [13],
necessarily complete) genome of the patient that they obtained Burrows-Wheeler Aligner [14] and CUSHAW [18]. In our envi-
prior to the attack. ronment we use the Burrows-Wheeler Aligner in combination
Altogether we have the following requirements: with SAMtools [16].
1) The computationally expensive sequence alignment and New NGS sequencers are able to generate reads with longer
assembly must be performed on a shared, potentially base sequences. One drawback of these reads is that they con-
vulnerable HPC cluster. tain more errors which the alignment tool must compensate.
2) The data should not be stored on disk in the cluster Aligners that follow the seed-and-extend paradigm, can deal
and must be completely deleted once the computation is with an increased number of incorrect bases. Examples are
finished. BWA-SW [15], Bowtie2 [12] or CUSHAW2 [17].
3) Wherever possible, the data must be secured using There are some approaches that use big data techniques for
strong encryption algorithms. their short-read alignments. Similar to our approach, Abuin
4) While it cannot be guaranteed that data are intercepted et al. use the MapReduce framework Apache Spark [27]
during computation in a public environment, it must not for parallelizing the Burrows-Wheeler Aligner [2]. However,
be possible to identify persons or probes based on the they do not secure the processed data what disallows the
data. usage of unsecure shared environments. Chen et al. presented
5) There must be no significant degradation of the compu- another approach based on the seed-and-extend paradigm that
tational performance, for example, by using encryption uses public and private computing clouds for their computa-
or other means to provide data security. tions [3]. The public cloud is used to reduce the number of
potential alignment positions while the reads’ final positions
In our use case we considered a particular system where
are determined in the private cloud. In contrast to our ap-
the data are generated by Illumina sequencers which produce
proach, this method only aligns the reads of a single patient,
reads of roughly 50 to 300 nucleotides. The design and
whereas our approach processes the data of many patients in
implementation of the system were influenced by this use case,
parallel.
but most of the descriptions and results are valid for other
types of sequencers as well. IV. A PPROACH
In this section we shortly describe the main ideas and
III. R ELATED W ORK
techniques of our approach before we give a more detailed
For processing big data, Hadoop [26] is one of the best description of our architecture in the next section. As we
known MapReduce frameworks. Embarrassingly parallel jobs stated in the requirements, the new system must guarantee
can be easily mapped onto this framework. In a MapReduce data security and perform fast parallel alignment of sequencing
framework tasks are mapped onto nodes for processing, and data.
the intermediate results are then reduced to final results [5]. We use a pipeline architecture which we divide into four
Hadoop consists of three main components: the MapReduce phases. The pre- and postprocessing Phases I, III and IV in
framework itself, the resource manager YARN [25], and the the secure environment package and sort the data, but they
2
9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017
mostly serve data security in Phase II. Phase II is the critical in the Decrypt & Unmix phase. Finally, the Merger merges all
one in which the data are sent to and processed on the shared files of a patient or probe into one final output file.
HPC cluster. During this complete phase the patient / probe A detailed explanation of the individual steps is given in
identifiers of the reads are encrypted which is possible as the following subsections. The last subsection describes what
they are not needed for the alignment. The rest of the data the user has to do to trigger the pipeline process.
is only encrypted during transmission, but vulnerable when it
is processed. A. Phase I: Anonymizer
Since it is possible to identify patients based on parts of
In the first phase, the Anonymizer reads the input data –
their DNA even if the respective genome sequences are in a
consisting of single or paired-ended reads of many patients –
larger pool of reads [3][9], we additionally obfuscate potential
from local storage in the secure area, adds encrypted identi-
attackers by mixing large amounts of reads, salting the mix
fiers, salts the input with fake data, and creates work packages
with additional fake ones and randomizing the order of all
(wp) of randomly mixed reads that are written back to local
reads. If the noisy sequences are chosen cleverly, it is much
storage. In the next paragraphs we describe these tasks in more
more difficult for attackers to identify patients by applying
detail.
statistical methods. In our pipeline, fake data is randomly
picked from a large, fixed set of genome data. The input is given as files in FASTQ format which can
For data security, we finally ensure that data are never stored be gzip-compressed. If compressed, the file names must end
in persistent memory in the cluster to reduce the number of with .gz. The Anonymizer can handle single-ended as well as
side channels that could be used by attackers. paired-ended reads. It looks for single-ended input files using
the regular expression .+.fastq.* and and for paired-ended
With respect to performance, we exploit the fact that se- input files using .+_[1|2].fastq.*. Two paired-ended
quences can be aligned independent of each other and use input files are assumed to belong together if their prefixes
a parallel MapReduce program. The principle of MapReduce before _1 and _2, repectively, are the same. The shared prefix
fits very well to our scenario [5]: the “Map” step aligns the is then used as the identifier or probe name. The probe name of
sequences and sorts the aligned reads, the “Reduce” step an input file with single-ended reads is the file name without
collects and outputs the data chromosome-wise. We chose the .fastq ending.
Hadoop (with YARN) as MapReduce framework. For the For later processing stages, the patient / probe of each read
alignment we use the Burrows-Wheeler Aligner (bwa) [14] must be retraceable because the Anonymizer mixes reads from
and for sorting the SAMtools tool suite [16]. several patients / probes into the same work package. If this
The file formats that we use are a result of the design / information was not added, the later stages could not map
software decisions. FASTQ is the format of the files pro- the reads back to their respective patients / probes. For this
duced by Illumina sequencers. The Sequence Alignment/Map reason, the Anonymizer augments each read (pair) with a
(SAM) format “is a generic alignment format for storing patient / probe identifier, which is then piggybacked through
read alignments against reference sequences” and the Binary the following stages. This is done before encryption. The
Alignment/Map (BAM) format is the binary representation of Anonymizer adds the identifier to the first line of a read entry.
it [16]. Both formats are supported by the Burrows-Wheeler More precisely, it prepends the patient / probe identifier to
Aligner and the SAMtools. the existing read identifier separating both by a #. If, for
example, the first line reads @foo and if the patient ID is
V. P IPELINE A RCHITECTURE Pa, the concatenation becomes @Pa#foo.
This section describes our pipeline architecture which we The anonymization is achieved by combining the following
restrict to the computational part; i.e., we ignore the data techniques:
acquisition by the sequencers and only consider the process 1) Salting: The actual data consisting of many different
between the point at which the input data in the form of short probes are salted by throwing dummy reads into the
reads are gathered on computers in the safe environment and mix. These dummy reads contain, for example, rare
the point at which the aligned reads are written back to the safe SNPs so that individual patients cannot be identified
environment. We assume that the short reads are contained in based on special genome segments. An attacker simply
FASTQ files and that the results are returned in BAM format. cannot distinguish between real and fake data which are
Hence, our solution consists of a pipeline that takes FASTQ processed like real data, but eventually filtered out in
files as input and outputs the aligned reads as BAM files. Phase III.
The pipeline is shown in Figure 1 and can be divided into 2) Mixing: Real and fake data are randomly assigned to the
four phases. Each phase takes the output data of the previous work packages which are sent to the cluster and which
one as input and creates a new output. For security and form the input of Phase II. This way, the attacker cannot
performance reasons, the input data are first anonymized and gather the relationship of two reads from their positions
divided into work packages. In Phase II, each work package in the work packages.
is processed in its own Hadoop job in an HPC system. Back 3) Encryption: The identifier is encrypted using the Ad-
in the secure area, the output is de-anomymized and reordered vanced Encryption Standard (AES) as defined in the
3
9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017
U.S. Federal Information Processing Standards Publica- output file named “header”. The ordering of the header lines
tion 197 [1][4]. As we cannot maintain the encryption is not changed during this process. The output files are
ordering in the later decryption step, the electronic code also in SAM format (but without header lines). The name
book (ECB) mode is used. To avoid that equal plain text of an output file specifies a patient / probe name, a work
blocks are encrypted into identical ciphertext blocks, we package name and a chromosome and follows the pattern
salt the identifiers by adding random bits to them at fixed __.
positions. The Anonymizer supports AES-128, AES-192 The decrypt & unmix tool iterates over the input files and,
and AES-256 encryption. The respective key of 16, 24 in each file, over all reads in the given order. For each read,
or 32 bytes can be provided in form of an input file. If it decrypts its identifier and simply appends the read to the
no (valid) keystore file is given, the Anomizer reads 32 output file of the respective patient / probe, work package and
bytes from /dev/random and uses them as the key. Note chromosome. If there is no such file, it is created.
that the encrypted identifier is not directly written to the
output file as it may include characters which render D. Phase IV: Merger
it invalid for later processing. Therefore, the encrypted The final step in the pipeline is the Merger which merges
data are base64-encoded before they are written to the the output files of Phase III such that there is one single file
output file. for each patient / probe.
B. Phase II: Hadoop The input has one file for each combination of patient /
probe, work package and chromosome. Each file is a list of
The second phase computes the alignment of the reads with reads sorted by their position within the chromosome. Hence,
respect to a given reference genome. It is the only one executed it remains to join the data for each patient / probe in one file
in a shared computing center. When the data are sent to the using a merge operation similar to the one of the merge sort
computing center and back, OpenSSH is used for encryption. algorithm. The Merger uses the regular expression .+_wp.+
The computation is organized subject to the MapReduce model to identify input files. The prefix before the underscore is the
and uses the Hadoop framework with YARN. identifier of the patient / probe.
Each work package generated in Phase I is processed by The Merger must wait for Phase III to finish and can
a separate Hadoop job. As explained in Section IV, the therefore not run parallel to it. It iterates over the input files
Map function computes the alignments for all reads using in chromosome order and, in each file, over all reads in the
the Burrows-Wheeler Aligner and sorts the output reads by given order. For each patient, it first merges the data from
their position in the reference genome using the SAMtools all her files belonging to chromosome 1, then from all files
utilities. After that, the Hadoop partitioner partitions the reads belonging to chromosome 2 and so on. In each step, it selects
for the Reduce function. Our partitioner partitions them by the read with the lowest position and adds it the output file of
their chromosomes and sends them to the reducers where the according patient.
each reducer represents one chromosome. The actual Reduce
While the entries in a chromosome file are already sorted
function outputs the data without further processing. For each
by their mapping position in the reference genome, the chro-
chromosome one output file is created.
mosome order must be established using the file “header”,
Considered in more detail, a map task runs three instances
provided that it was created in Phase III, or the reference
of the Burrows-Wheeler Aligner bwa. Assuming paired-ended
genome’s annotation file (*.ann) or sequence dictionary
(single-ended) reads, it performs two bwa aln calls which
(*.dict). If none of these files exist, the tool aborts.
align the reads and one bwa sampe (bwa samse) call
The Merger uses heuristics and parameters to estimate the
which transforms the results to SAM format.
number of merges that can be performed in parallel and
To further improve security, the YARN environment is
reserves CPU cores and memory accordingly.
configured to use only local RAM disks of the compute
nodes. Hence, no read is stored on persistent storage during E. Pipeline Usage
computation.
Before the pipeline process is started, the user has to define
C. Phase III: Decrypt & Unmix input and output directory and provide the data. Once started,
The Decrypt & Unmix stage processes the output (chromo- the pipeline does not require any interventions from the user.
some) files of the Hadoop jobs. Our tool generates one output So, the steps are:
file for every patient / probe and work package, loops through 1) Define workspace directory () and input
all reads in the input, decrypts their identifiers and sorts them directory (e.g., /input).
in the right output files. Fake reads are discarded. 2) Copy the configuration file to /input
Input files are identified by the regular expression and modify it if necessary.
.+_chr.+, and all other files in the input directory are 3) Call the starter script: starter -i /
ignored. It is assumed that the input files are in SAM format, input -w
and missing header lines are tolerated. If the tool finds header 4) Wait for the pipeline to finish.
lines in one of the input files, it copies them to an additional 5) Obtain results from /results.
4
9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017
40000 Abs. times pipeline Avg. processing times of the workp. Concerning the quality of the encryption, we pointed out
anonymize 1.2 upload cpy back before that we have to use AES in ECB mode, but that we
35000 processing mapred DU
merge 1.0
avoid identical ciphertext blocks in the case of identical plain
30000
text blocks by adding random bits to the plain text. For this
time in seconds
25000 0.8
relative time
reason, the encryption techniques, AES and OpenSSH, can be
20000 assumed to be strong.
0.6
15000 For analyzing the fifth requirement, we ran tests in an HPC
0.4 cluster. The “secure environment” is a remote computer, but
10000
0.2 connected via a high-speed network link (10 GB/s). Figure 2
5000
shows processing times of the pipeline for an input data set
0 0.0
1node 4nodes 16nodes 1node 4nodes 16nodes of 237M short reads (100 GB) and different cluster sizes,
each node having 64 cores and 128 GB memory. The left
Fig. 2: Left: absolute runtimes of Anonymizer, Merger, and the
plot indicates that the processing scales extremely well. The
Hadoop processing in between. Right: runtime for processing
running time is roughly quartered when the number of nodes
the work packages relative to other operations (DU: decrypt
is quadrupled.
& unmix).
The right figure shows the relative processing times of the
work packages. For a small number of nodes, the upload to the
cluster dominates the total time because the cluster nodes can
The configuration specified in the configuration file is
store only a limited number of work packages in their memory
related to the cluster side and can be determined by the
and the packages must remain in the secure area until space
provider: Pipeline / cluster settings (size of work packages,
in the cluster nodes is freed. For a large number of nodes,
#nodes, #threads, #pseudo samples, host name, user group
however, almost all the time is spent on processing, and the
on host, batch queue on host etc.), tool settings (Hadoop,
overhead of using an outside cluster is neglectable.
BWA, SAMtools and Java) and the location of additional data
In both figures, one can see that the runtimes of Phase I
(reference genome, pseudo genome data).
(Anonymizer) and III (Decrypt & Unmix) are relatively short
During the run, the pipeline creates a heavy IO workload
so that the overhead due to anonymizing / encryption and
in the workspace directory. Hence, it is advisable to put the
deanonymizing / decryption is acceptable. Note that since
workspace on a strong storage backend.
work package processing starts as soon as the first work
VI. E VALUATION package arrives at the cluster, the runtimes of the Anonymizer
and the work package processing also overlap.
In this section we evaluate whether the system meets the five
requirements of Section II. The first and the second one are VII. C ONCLUSION
obviously fulfilled because we perform all costly operations In this work we have presented a system for processing sen-
on a public HPC cluster and we do this without storing any sitive genome data in a public environment without harming
data on disk. the privacy of patients. Our secure genome processing pipeline
The third requirement is met because the data are during is already used for processing data of real cancer patients in
transmission to and from the cluster. On the cluster nodes a university’s HPC cluster which is shared with many other
themselves the sequences are readable, but for processing they users. As shown in our evaluation, the system’s processing
have to be. performance is very good and scales excellently.
The identifiers of the patients / probes, however, remain
encrypted even on the cluster nodes which helps with the ACKNOWLEDGMENT
fourth requirement which is a bit more tricky. While it is not This work was supported by the German Federal Ministry
possible to read the identifier without breaking the encryption, of Education and Research (BMBF) under grant 131A029D
it might be possible to identify a person in the mix by reading (Project “CI3”).
sufficiently many sequences, mapping them to an old reference
genome of that person (provided that the attacker has one) and R EFERENCES
computing the statistical probability for that person to be in [1] “Announcing the advanced encryption standard (aes),” Federal Informa-
the mix. Chen et al. [3] and Homer et al. [9] showed that tion Processing Standards Publication 197, Tech. Rep., 2001.
[2] J. M. Abuín, J. C. Pichel, T. F. Pena, and J. Amigo, “Sparkbwa: Speeding
this is possible to a certain degree utilizing the existence of up the alignment of high-throughput dna sequencing data,” PLoS ONE,
rare SNPs, but that it becomes more and more difficult the vol. 11, no. 5, 2016.
more other reads are in the mix. Unfortunately, there is no [3] Y. Chen, B. Peng, X. Wang, and H. Tang, “Large-scale privacy-
preserving mapping of human genomic sequences on hybrid clouds,”
precise analysis of how many and what data the mix has to be in NDSS, 2012.
salted with to render this attack ineffective. In our approach [4] J. Daemen and V. Rijmen, The Design of Rijndael. Secaucus, NJ, USA:
we usually at least double the amount of data and pick reads Springer-Verlag New York, Inc., 2002.
[5] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on
with rare SNPs so that the patients’ peculiarities do not stand large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
out. [Online]. Available: http://doi.acm.org/10.1145/1327452.1327492
5
9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017
[6] B. J. Fikes. (2017) New Machines Can Sequence 2Nd USENIX Conference on Hot Topics in Cloud Computing, ser.
Human Genome in One Day. [Online]. Available: HotCloud’10. Berkeley, CA, USA: USENIX Association, 2010, pp.
http://www.sci-tech-today.com/news/Genome-Sequencing-in-One-Day/ 10–10. [Online]. Available: http://dl.acm.org/citation.cfm?id=1863103.
story.xhtml?story\_id=023001ATQM02 1863113
[7] G. S. Ginsburg and H. F. Willard, Eds., Genomic and Personalized
Medicine (Second Edition), second edition ed. Academic Press,
2013. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/B978012382227700121X
[8] J. M. Heather and B. Chain, “The sequence of sequencers: The
history of sequencing {DNA},” Genomics, vol. 107, no. 1, pp. 1 – 8,
2016. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0888754315300410
[9] N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe,
J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson, and D. W. Craig,
“Resolving individuals contributing trace amounts of dna to highly
complex mixtures using high-density snp genotyping microarrays,” PLoS
Genet, vol. 4, no. 8, pp. 1–9, 08 2008.
[10] S. Kreiter, M. Vormehr, N. van de Roemer, M. Diken, M. Löwer,
J. Diekmann, S. Boegel, B. Schrörs, F. Vascotto, J. C. Castle, A. D.
Tadmor, S. P. Schoenberger, C. Huber, Özlem Türeci, and U. Sahin,
“Mutant mhc class ii epitopes drive therapeutic immune responses to
cancer,” Nature, vol. 520, pp. 692 – 696, Mar. 2015.
[11] S. T. Krishnan and J. U. Gonzalez, Building Your Next Big Thing
with Google Cloud Platform: A Guide for Developers and Enterprise
Architects, 1st ed. Berkely, CA, USA: Apress, 2015.
[12] B. Langmead and S. L. Salzberg, “"fast gapped-read alignment with
bowtie 2.",” Nature methods, vol. 9, no. 4, p. 357–359, Mar 2012.
[13] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast
and memory-efficient alignment of short dna sequences to the human
genome,” Genome Biology, vol. 10, no. 3, p. R25, 2009.
[14] H. Li and R. Durbin, “Fast and accurate short read alignment with
Burrows-Wheeler transform,” Bioinformatics, vol. 25, no. 14, pp.
1754–1760, Jul 2009. [Online]. Available: http://dx.doi.org/10.1093/
bioinformatics/btp324
[15] ——, “"fast and accurate long-read alignment with burrows–wheeler
transform.",” Bioinformatics, vol. 26, no. 5, p. 589–595, Mar 2010.
[16] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer,
G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/Map
Format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp.
2078–2079, Aug. 2009. [Online]. Available: http://dx.doi.org/10.1093/
bioinformatics/btp352
[17] Y. Liu and B. Schmidt, “Long read alignment based on maximal exact
match seeds,” Bioinformatics, vol. 28, no. 18, pp. i318–i324, 2012.
[18] Y. Liu, B. Schmidt, and D. L. Maskell, “Cushaw: a cuda compatible short
read aligner to large genomes based on the burrows–wheeler transform,”
Bioinformatics, vol. 28, no. 14, p. 1830, 2012.
[19] S. Marguerat and J. Bähler, “Rna-seq: from technology to biology.”
Cell Mol Life Sci, vol. 67, no. 4, pp. 569–579, Feb 2010. [Online].
Available: http://dx.doi.org/10.1007/s00018-009-0180-6
[20] S. Moorthie, C. J. Mattocks, and C. F. Wright, “Review of massively
parallel dna sequencing technologies,” Hugo Journal, vol. 5, no. 1-4,
pp. 1 – 12, 2011.
[21] R. Nadipalli, HDInsight Essentials, 2nd ed. Packt Publishing, 2015.
[22] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop
distributed file system,” in Proceedings of the 2010 IEEE 26th
Symposium on Mass Storage Systems and Technologies (MSST), ser.
MSST ’10. Washington, DC, USA: IEEE Computer Society, 2010,
pp. 1–10. [Online]. Available: http://dx.doi.org/10.1109/MSST.2010.
5496972
[23] A. Singh and V. Rayapati, Learning Big Data with Amazon Elastic
MapReduce. Packt Publishing, 2014.
[24] M. Snir, S. W. Otto, D. W. Walker, J. Dongarra, and S. Huss-Lederman,
MPI: The Complete Reference. Cambridge, MA, USA: MIT Press,
1995.
[25] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,
R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,
O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache
hadoop yarn: Yet another resource negotiator,” in Proceedings of
the 4th Annual Symposium on Cloud Computing, ser. SOCC ’13.
New York, NY, USA: ACM, 2013, pp. 5:1–5:16. [Online]. Available:
http://doi.acm.org/10.1145/2523616.2523633
[26] T. White, Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012.
[27] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
“Spark: Cluster computing with working sets,” in Proceedings of the
6