<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IWSG</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Secure Genome Processing in Public Cloud and HPC Environments</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>André Brinkmann , Jürgen Kaiser</institution>
          ,
          <addr-line>Martin Löwer</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>19</volume>
      <fpage>19</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>-Aligning next generation sequencing data requires significant compute resources. HPC and cloud systems can provide sufficient compute capacity, but do not offer the required data security guarantees. HPC environments are typically designed for many groups of trusted users and often only include minimal security enforcement, while Cloud environments are mostly under the control of untrusted entities and companies. In this work we present a scalable pipeline approach that enables the use of public Cloud and HPC environments, while improving the patients' privacy. The applied techniques include adding noisy data, cryptography, and a MapReduce program for the parallel processing of data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <sec id="sec-1-1">
        <title>High-throughput sequencing, also known as next-generation</title>
        <p>sequencing (NGS), is a revolutionary technology that enables
the sequencing of entire genomes in a matter of hours.
Common techniques like sequencing-by-synthesis require a lot
of computational power for post-processing the genome data,
that is, for aligning them to a reference sequence / genome
and assembling them according to this mapping.</p>
        <p>
          NGS techniques produce large amounts of genome data in
the form of short reads (about 50 to 300 bases) which usually
need to be aligned to a reference genome [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. NGS has not
only revolutionized biological and medical research [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], but
also offers opportunities for the treatment of diseases. The
implementation of personalized medicine [
          <xref ref-type="bibr" rid="ref10 ref28 ref29">10</xref>
          ], for example,
requires the analysis of human genomes at a large scale
and involves sequencing genome data from thousands of
individuals per year at one facility [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The alignment of such
an amount of data is only possible by utilizing the processing
power of large computing centers.
        </p>
        <p>Current solutions typically do not involve public cloud
computing or academic high-performance computing (HPC)
systems, but rely on expensive in-house facilities to ensure the
privacy of the data. The reason is that the patients’ data could
be hijacked in public computing environments. HPC systems
are shared by many groups of trusted users, and data security
has a low priority. Cloud environments, on the other hand,
may provide a slightly better data security, but they are mostly
under the control of untrusted entities and companies. Hence,
to exploit external computing clusters, it is necessary to make
the transfer of sensitive genome data and their processing
secure – without impacting performance too much.</p>
        <p>In this paper we present a solution to this problem which
outsources the computations to public multi-user HPC or cloud
facilities while ensuring data security and fast processing. The
techniques applied include adding noise, cryptography, and a
novel method for the parallel processing of genome data of
many patients. We argue that the system is not only
inexpensive, but also secure and show that the system’s performance
is only slightly degraded by our security measures.</p>
        <p>The remainder of the paper is structured as follows: In
Section II we describe the scenario in more detail and derive
five requirements. In Section III we discuss tools and solutions
from the literature. In Section IV we outline our approach
before we give a more detailed description of our pipeline
architecture in Section V. In the evaluation in Section VI
we check whether the requirements are fulfilled. Finally, we
conclude the paper in Section VII.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>II. SCENARIO &amp; REQUIREMENTS</title>
      <p>
        Every day in biological and medical research, large amounts
of genomes are processed to determine their nucleotide
sequences. Besides research, genome sequencing is vital for
the personalized medicine of the present and future [
        <xref ref-type="bibr" rid="ref10 ref28 ref29">10</xref>
        ], for
example, for individualized immunotherapies which were
considered in the CI3 project1. The techniques of high-throughput
sequencing or next-generation sequencing have accelerated the
process and made it possible to sequence the entire human
genome in one day for less than $1000 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>An important part of the process is performed on
highperformance computers. Since sequencers produce their
(digitalized) output in the form of short reads (i.e., sequences),
the reads need to be arranged and interpreted. This is done
by aligning the reads to a reference genome which is a very
costly operation requiring a lot of computational power. At
the same time, the data are highly confidential and must
not be vulnerable to attackers. It is therefore not possible to
simply transmit the data to a public cloud or high-performance
computing environment. Yet, on the other hand, it is very / too
costly for many institutes to abandon such cheap solutions and
install a high-performance computing (HPC) facility in-house.
probe_1.fastq</p>
      <p>…
probe_n.fastq
wp1
wp2
wp3
…
wpm</p>
      <p>Cluster
MapReduce</p>
      <p>bwa
samtools
wp1_chr1
wp1_chr2</p>
      <p>…
wp1_chrx</p>
      <p>…
wpm_chrx</p>
      <p>Decrypt &amp;</p>
      <p>Unmix</p>
      <p>P1_wp1_chr1
P1_wp1_chr2</p>
      <p>…
P1_wp1_chrx</p>
      <p>…
Pn_wpm_chrx</p>
      <p>For this reason, we consider the scenario in which a public
HPC cluster is used, but under the condition that the data are
secure at any time in that an attacker might steal information,
but is not able to identify the person that these data belong to.
We assume that the sequencing has already been performed
and that the short reads are provided as files. We regard the
environment where the data originate (e.g., some institute) as
a secure environment with strict access rights enforcement.
Thus, we only consider the transmission to and the processing
at the HPC cluster as vulnerable. It can be accessed by
potentially malicious users, and no data being processed or
stored there can be deemed secure. So, malicious users may
read the data and try to identify patients, either by reading the
identifiers of the sequences or by aligning the data to a (not
necessarily complete) genome of the patient that they obtained
prior to the attack.</p>
      <p>Altogether we have the following requirements:
1) The computationally expensive sequence alignment and
assembly must be performed on a shared, potentially
vulnerable HPC cluster.
2) The data should not be stored on disk in the cluster
and must be completely deleted once the computation is
finished.
3) Wherever possible, the data must be secured using
strong encryption algorithms.
4) While it cannot be guaranteed that data are intercepted
during computation in a public environment, it must not
be possible to identify persons or probes based on the
data.
5) There must be no significant degradation of the
computational performance, for example, by using encryption
or other means to provide data security.</p>
      <p>In our use case we considered a particular system where
the data are generated by Illumina sequencers which produce
reads of roughly 50 to 300 nucleotides. The design and
implementation of the system were influenced by this use case,
but most of the descriptions and results are valid for other
types of sequencers as well.</p>
    </sec>
    <sec id="sec-3">
      <title>III. RELATED WORK</title>
      <p>
        For processing big data, Hadoop [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] is one of the best
known MapReduce frameworks. Embarrassingly parallel jobs
can be easily mapped onto this framework. In a MapReduce
framework tasks are mapped onto nodes for processing, and
the intermediate results are then reduced to final results [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Hadoop consists of three main components: the MapReduce
framework itself, the resource manager YARN [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and the
distrubuted file system HDFS [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. The HDFS file system
is distributed over the complete Hadoop cluster and stores
the job-related data. The resource manager YARN allows to
execute applications distributed on a Hadoop cluster and to
reserve the required resources.
      </p>
      <p>
        There are many alternatives to Hadoop. In HPC systems,
one usually uses MPI for job parallelization [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Big vendors
like Microsoft, Amazon or Google offer infrastructures for
processing big data in the cloud [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ][
        <xref ref-type="bibr" rid="ref23">23</xref>
        ][
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        In recent years many tools for short-read alignment have
been developed. The first ones focused on reads that consist of
at most 100 nucleotide bases. Due to the small read size it was
assumed that the quality / correctness of the sequences was
high. Examples for these first-generation tools are Bowtie [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
      </p>
      <sec id="sec-3-1">
        <title>Burrows-Wheeler Aligner [14] and CUSHAW [18]. In our envi</title>
        <p>
          ronment we use the Burrows-Wheeler Aligner in combination
with SAMtools [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>
          New NGS sequencers are able to generate reads with longer
base sequences. One drawback of these reads is that they
contain more errors which the alignment tool must compensate.
Aligners that follow the seed-and-extend paradigm, can deal
with an increased number of incorrect bases. Examples are
BWA-SW [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], Bowtie2 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] or CUSHAW2 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>
          There are some approaches that use big data techniques for
their short-read alignments. Similar to our approach, Abuin
et al. use the MapReduce framework Apache Spark [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]
for parallelizing the Burrows-Wheeler Aligner [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. However,
they do not secure the processed data what disallows the
usage of unsecure shared environments. Chen et al. presented
another approach based on the seed-and-extend paradigm that
uses public and private computing clouds for their
computations [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The public cloud is used to reduce the number of
potential alignment positions while the reads’ final positions
are determined in the private cloud. In contrast to our
approach, this method only aligns the reads of a single patient,
whereas our approach processes the data of many patients in
parallel.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>IV. APPROACH</title>
      <p>In this section we shortly describe the main ideas and
techniques of our approach before we give a more detailed
description of our architecture in the next section. As we
stated in the requirements, the new system must guarantee
data security and perform fast parallel alignment of sequencing
data.</p>
      <p>We use a pipeline architecture which we divide into four
phases. The pre- and postprocessing Phases I, III and IV in
the secure environment package and sort the data, but they
mostly serve data security in Phase II. Phase II is the critical
one in which the data are sent to and processed on the shared
HPC cluster. During this complete phase the patient / probe
identifiers of the reads are encrypted which is possible as
they are not needed for the alignment. The rest of the data
is only encrypted during transmission, but vulnerable when it
is processed.</p>
      <p>
        Since it is possible to identify patients based on parts of
their DNA even if the respective genome sequences are in a
larger pool of reads [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we additionally obfuscate potential
attackers by mixing large amounts of reads, salting the mix
with additional fake ones and randomizing the order of all
reads. If the noisy sequences are chosen cleverly, it is much
more difficult for attackers to identify patients by applying
statistical methods. In our pipeline, fake data is randomly
picked from a large, fixed set of genome data.
      </p>
      <p>For data security, we finally ensure that data are never stored
in persistent memory in the cluster to reduce the number of
side channels that could be used by attackers.</p>
      <p>
        With respect to performance, we exploit the fact that
sequences can be aligned independent of each other and use
a parallel MapReduce program. The principle of MapReduce
fits very well to our scenario [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: the “Map” step aligns the
sequences and sorts the aligned reads, the “Reduce” step
collects and outputs the data chromosome-wise. We chose
Hadoop (with YARN) as MapReduce framework. For the
alignment we use the Burrows-Wheeler Aligner (bwa) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
and for sorting the SAMtools tool suite [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        The file formats that we use are a result of the design /
software decisions. FASTQ is the format of the files
produced by Illumina sequencers. The Sequence Alignment/Map
(SAM) format “is a generic alignment format for storing
read alignments against reference sequences” and the Binary
Alignment/Map (BAM) format is the binary representation of
it [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Both formats are supported by the Burrows-Wheeler
Aligner and the SAMtools.
      </p>
    </sec>
    <sec id="sec-5">
      <title>V. PIPELINE ARCHITECTURE</title>
      <p>This section describes our pipeline architecture which we
restrict to the computational part; i.e., we ignore the data
acquisition by the sequencers and only consider the process
between the point at which the input data in the form of short
reads are gathered on computers in the safe environment and
the point at which the aligned reads are written back to the safe
environment. We assume that the short reads are contained in
FASTQ files and that the results are returned in BAM format.
Hence, our solution consists of a pipeline that takes FASTQ
files as input and outputs the aligned reads as BAM files.</p>
      <p>The pipeline is shown in Figure 1 and can be divided into
four phases. Each phase takes the output data of the previous
one as input and creates a new output. For security and
performance reasons, the input data are first anonymized and
divided into work packages. In Phase II, each work package
is processed in its own Hadoop job in an HPC system. Back
in the secure area, the output is de-anomymized and reordered
in the Decrypt &amp; Unmix phase. Finally, the Merger merges all
files of a patient or probe into one final output file.</p>
      <p>A detailed explanation of the individual steps is given in
the following subsections. The last subsection describes what
the user has to do to trigger the pipeline process.</p>
      <sec id="sec-5-1">
        <title>A. Phase I: Anonymizer</title>
        <p>In the first phase, the Anonymizer reads the input data –
consisting of single or paired-ended reads of many patients –
from local storage in the secure area, adds encrypted
identifiers, salts the input with fake data, and creates work packages
(wp) of randomly mixed reads that are written back to local
storage. In the next paragraphs we describe these tasks in more
detail.</p>
        <p>The input is given as files in FASTQ format which can
be gzip-compressed. If compressed, the file names must end
with .gz. The Anonymizer can handle single-ended as well as
paired-ended reads. It looks for single-ended input files using
the regular expression .+.fastq.* and and for paired-ended
input files using .+_[1|2].fastq.*. Two paired-ended
input files are assumed to belong together if their prefixes
before _1 and _2, repectively, are the same. The shared prefix
is then used as the identifier or probe name. The probe name of
an input file with single-ended reads is the file name without
the .fastq ending.</p>
        <p>For later processing stages, the patient / probe of each read
must be retraceable because the Anonymizer mixes reads from
several patients / probes into the same work package. If this
information was not added, the later stages could not map
the reads back to their respective patients / probes. For this
reason, the Anonymizer augments each read (pair) with a
patient / probe identifier, which is then piggybacked through
the following stages. This is done before encryption. The
Anonymizer adds the identifier to the first line of a read entry.
More precisely, it prepends the patient / probe identifier to
the existing read identifier separating both by a #. If, for
example, the first line reads @foo and if the patient ID is
Pa, the concatenation becomes @Pa#foo.</p>
        <p>
          The anonymization is achieved by combining the following
techniques:
1) Salting: The actual data consisting of many different
probes are salted by throwing dummy reads into the
mix. These dummy reads contain, for example, rare
SNPs so that individual patients cannot be identified
based on special genome segments. An attacker simply
cannot distinguish between real and fake data which are
processed like real data, but eventually filtered out in
Phase III.
2) Mixing: Real and fake data are randomly assigned to the
work packages which are sent to the cluster and which
form the input of Phase II. This way, the attacker cannot
gather the relationship of two reads from their positions
in the work packages.
3) Encryption: The identifier is encrypted using the
Advanced Encryption Standard (AES) as defined in the
U.S. Federal Information Processing Standards
Publication 197 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ][
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. As we cannot maintain the encryption
ordering in the later decryption step, the electronic code
book (ECB) mode is used. To avoid that equal plain text
blocks are encrypted into identical ciphertext blocks, we
salt the identifiers by adding random bits to them at fixed
positions. The Anonymizer supports AES-128, AES-192
and AES-256 encryption. The respective key of 16, 24
or 32 bytes can be provided in form of an input file. If
no (valid) keystore file is given, the Anomizer reads 32
bytes from /dev/random and uses them as the key. Note
that the encrypted identifier is not directly written to the
output file as it may include characters which render
it invalid for later processing. Therefore, the encrypted
data are base64-encoded before they are written to the
output file.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>B. Phase II: Hadoop</title>
        <p>The second phase computes the alignment of the reads with
respect to a given reference genome. It is the only one executed
in a shared computing center. When the data are sent to the
computing center and back, OpenSSH is used for encryption.
The computation is organized subject to the MapReduce model
and uses the Hadoop framework with YARN.</p>
        <p>Each work package generated in Phase I is processed by
a separate Hadoop job. As explained in Section IV, the
Map function computes the alignments for all reads using
the Burrows-Wheeler Aligner and sorts the output reads by
their position in the reference genome using the SAMtools
utilities. After that, the Hadoop partitioner partitions the reads
for the Reduce function. Our partitioner partitions them by
their chromosomes and sends them to the reducers where
each reducer represents one chromosome. The actual Reduce
function outputs the data without further processing. For each
chromosome one output file is created.</p>
        <p>Considered in more detail, a map task runs three instances
of the Burrows-Wheeler Aligner bwa. Assuming paired-ended
(single-ended) reads, it performs two bwa aln calls which
align the reads and one bwa sampe (bwa samse) call
which transforms the results to SAM format.</p>
        <p>To further improve security, the YARN environment is
configured to use only local RAM disks of the compute
nodes. Hence, no read is stored on persistent storage during
computation.</p>
      </sec>
      <sec id="sec-5-3">
        <title>C. Phase III: Decrypt &amp; Unmix</title>
        <p>The Decrypt &amp; Unmix stage processes the output
(chromosome) files of the Hadoop jobs. Our tool generates one output
file for every patient / probe and work package, loops through
all reads in the input, decrypts their identifiers and sorts them
in the right output files. Fake reads are discarded.</p>
        <p>Input files are identified by the regular expression
.+_chr.+, and all other files in the input directory are
ignored. It is assumed that the input files are in SAM format,
and missing header lines are tolerated. If the tool finds header
lines in one of the input files, it copies them to an additional
output file named “header”. The ordering of the header lines
is not changed during this process. The output files are
also in SAM format (but without header lines). The name
of an output file specifies a patient / probe name, a work
package name and a chromosome and follows the pattern
&lt;probe&gt;_&lt;workpackage&gt;_&lt;chromosome&gt;.</p>
        <p>The decrypt &amp; unmix tool iterates over the input files and,
in each file, over all reads in the given order. For each read,
it decrypts its identifier and simply appends the read to the
output file of the respective patient / probe, work package and
chromosome. If there is no such file, it is created.</p>
      </sec>
      <sec id="sec-5-4">
        <title>D. Phase IV: Merger</title>
        <p>The final step in the pipeline is the Merger which merges
the output files of Phase III such that there is one single file
for each patient / probe.</p>
        <p>The input has one file for each combination of patient /
probe, work package and chromosome. Each file is a list of
reads sorted by their position within the chromosome. Hence,
it remains to join the data for each patient / probe in one file
using a merge operation similar to the one of the merge sort
algorithm. The Merger uses the regular expression .+_wp.+
to identify input files. The prefix before the underscore is the
identifier of the patient / probe.</p>
        <p>The Merger must wait for Phase III to finish and can
therefore not run parallel to it. It iterates over the input files
in chromosome order and, in each file, over all reads in the
given order. For each patient, it first merges the data from
all her files belonging to chromosome 1, then from all files
belonging to chromosome 2 and so on. In each step, it selects
the read with the lowest position and adds it the output file of
the according patient.</p>
        <p>While the entries in a chromosome file are already sorted
by their mapping position in the reference genome, the
chromosome order must be established using the file “header”,
provided that it was created in Phase III, or the reference
genome’s annotation file (*.ann) or sequence dictionary
(*.dict). If none of these files exist, the tool aborts.</p>
        <p>The Merger uses heuristics and parameters to estimate the
number of merges that can be performed in parallel and
reserves CPU cores and memory accordingly.</p>
      </sec>
      <sec id="sec-5-5">
        <title>E. Pipeline Usage</title>
        <p>Before the pipeline process is started, the user has to define
input and output directory and provide the data. Once started,
the pipeline does not require any interventions from the user.
So, the steps are:
1) Define workspace directory (&lt;workspace&gt;) and input
directory (e.g., &lt;workspace&gt;/input).
2) Copy the configuration file to &lt;workspace&gt;/input
and modify it if necessary.
3) Call the starter script: starter -i &lt;workspace&gt;/
input -w &lt;workspace&gt;
4) Wait for the pipeline to finish.
5) Obtain results from &lt;workspace&gt;/results.</p>
        <p>The configuration specified in the configuration file is
related to the cluster side and can be determined by the
provider: Pipeline / cluster settings (size of work packages,
#nodes, #threads, #pseudo samples, host name, user group
on host, batch queue on host etc.), tool settings (Hadoop,
BWA, SAMtools and Java) and the location of additional data
(reference genome, pseudo genome data).</p>
        <p>During the run, the pipeline creates a heavy IO workload
in the workspace directory. Hence, it is advisable to put the
workspace on a strong storage backend.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>VI. EVALUATION</title>
      <p>In this section we evaluate whether the system meets the five
requirements of Section II. The first and the second one are
obviously fulfilled because we perform all costly operations
on a public HPC cluster and we do this without storing any
data on disk.</p>
      <p>The third requirement is met because the data are during
transmission to and from the cluster. On the cluster nodes
themselves the sequences are readable, but for processing they
have to be.</p>
      <p>
        The identifiers of the patients / probes, however, remain
encrypted even on the cluster nodes which helps with the
fourth requirement which is a bit more tricky. While it is not
possible to read the identifier without breaking the encryption,
it might be possible to identify a person in the mix by reading
sufficiently many sequences, mapping them to an old reference
genome of that person (provided that the attacker has one) and
computing the statistical probability for that person to be in
the mix. Chen et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Homer et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] showed that
this is possible to a certain degree utilizing the existence of
rare SNPs, but that it becomes more and more difficult the
more other reads are in the mix. Unfortunately, there is no
precise analysis of how many and what data the mix has to be
salted with to render this attack ineffective. In our approach
we usually at least double the amount of data and pick reads
with rare SNPs so that the patients’ peculiarities do not stand
out.
      </p>
      <p>Concerning the quality of the encryption, we pointed out
before that we have to use AES in ECB mode, but that we
avoid identical ciphertext blocks in the case of identical plain
text blocks by adding random bits to the plain text. For this
reason, the encryption techniques, AES and OpenSSH, can be
assumed to be strong.</p>
      <p>For analyzing the fifth requirement, we ran tests in an HPC
cluster. The “secure environment” is a remote computer, but
connected via a high-speed network link (10 GB/s). Figure 2
shows processing times of the pipeline for an input data set
of 237M short reads (100 GB) and different cluster sizes,
each node having 64 cores and 128 GB memory. The left
plot indicates that the processing scales extremely well. The
running time is roughly quartered when the number of nodes
is quadrupled.</p>
      <p>The right figure shows the relative processing times of the
work packages. For a small number of nodes, the upload to the
cluster dominates the total time because the cluster nodes can
store only a limited number of work packages in their memory
and the packages must remain in the secure area until space
in the cluster nodes is freed. For a large number of nodes,
however, almost all the time is spent on processing, and the
overhead of using an outside cluster is neglectable.</p>
      <p>In both figures, one can see that the runtimes of Phase I
(Anonymizer) and III (Decrypt &amp; Unmix) are relatively short
so that the overhead due to anonymizing / encryption and
deanonymizing / decryption is acceptable. Note that since
work package processing starts as soon as the first work
package arrives at the cluster, the runtimes of the Anonymizer
and the work package processing also overlap.</p>
    </sec>
    <sec id="sec-7">
      <title>VII. CONCLUSION</title>
      <p>In this work we have presented a system for processing
sensitive genome data in a public environment without harming
the privacy of patients. Our secure genome processing pipeline
is already used for processing data of real cancer patients in
a university’s HPC cluster which is shared with many other
users. As shown in our evaluation, the system’s processing
performance is very good and scales excellently.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENT</title>
      <p>This work was supported by the German Federal Ministry
of Education and Research (BMBF) under grant 131A029D
(Project “CI3”).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] “Announcing the advanced encryption standard (aes</article-title>
          ),
          <source>” Federal Information Processing Standards Publication</source>
          <volume>197</volume>
          ,
          <string-name>
            <surname>Tech</surname>
          </string-name>
          . Rep.,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Abuín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Pichel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Pena</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Amigo</surname>
          </string-name>
          , “Sparkbwa:
          <article-title>Speeding up the alignment of high-throughput dna sequencing data</article-title>
          ,”
          <source>PLoS ONE</source>
          , vol.
          <volume>11</volume>
          , no.
          <issue>5</issue>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          , “
          <article-title>Large-scale privacypreserving mapping of human genomic sequences on hybrid clouds</article-title>
          ,” in NDSS,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Daemen</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Rijmen</surname>
          </string-name>
          ,
          <source>The Design of Rijndael</source>
          . Secaucus, NJ, USA: Springer-Verlag New York, Inc.,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          , “Mapreduce:
          <article-title>Simplified data processing on large clusters,” Commun</article-title>
          . ACM, vol.
          <volume>51</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          , Jan.
          <year>2008</year>
          . [Online]. Available: http://doi.acm.
          <source>org/10</source>
          .1145/1327452.1327492
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>B. J. Fikes.</surname>
          </string-name>
          (
          <year>2017</year>
          )
          <article-title>New Machines Can Sequence Human Genome in One Day</article-title>
          . [Online]. Available: http://www.sci
          <article-title>-tech-today.com/news/Genome-Sequencing-in-One-Day/ story</article-title>
          .xhtml?story\_id=023001ATQM02
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Ginsburg</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. F.</given-names>
            <surname>Willard</surname>
          </string-name>
          , Eds., Genomic and Personalized
          <string-name>
            <surname>Medicine (Second Edition</surname>
          </string-name>
          ), second edition ed. Academic Press,
          <year>2013</year>
          . [Online]. Available: http://www.sciencedirect.com/science/article/ pii/B978012382227700121X
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Heather</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Chain</surname>
          </string-name>
          , “
          <article-title>The sequence of sequencers: The history of sequencing {DNA}</article-title>
          ,
          <source>” Genomics</source>
          , vol.
          <volume>107</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2016</year>
          . [Online]. Available: http://www.sciencedirect.com/science/article/ pii/S0888754315300410
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Homer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Szelinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Redman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Duggan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tembe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Muehling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Pearson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Stephan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Nelson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Craig</surname>
          </string-name>
          , “
          <article-title>Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays</article-title>
          ,
          <source>” PLoS Genet</source>
          , vol.
          <volume>4</volume>
          , no.
          <issue>8</issue>
          , pp.
          <fpage>1</fpage>
          -
          <issue>9</issue>
          ,
          <year>08 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vormehr</surname>
          </string-name>
          , N. van de Roemer,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Löwer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Diekmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Boegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schrörs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vascotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Castle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Tadmor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Schoenberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huber</surname>
          </string-name>
          , Özlem Türeci, and U. Sahin, “
          <article-title>Mutant mhc class ii epitopes drive therapeutic immune responses to cancer</article-title>
          ,”
          <source>Nature</source>
          , vol.
          <volume>520</volume>
          , pp.
          <fpage>692</fpage>
          -
          <lpage>696</lpage>
          , Mar.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. U.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <article-title>Building Your Next Big Thing with Google Cloud Platform: A Guide for Developers and Enterprise Architects</article-title>
          , 1st ed. Berkely, CA, USA: Apress,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Langmead</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Salzberg</surname>
          </string-name>
          , “
          <article-title>"fast gapped-read alignment with bowtie 2.",” Nature methods</article-title>
          , vol.
          <volume>9</volume>
          , no.
          <issue>4</issue>
          , p.
          <fpage>357</fpage>
          -
          <lpage>359</lpage>
          ,
          <year>Mar 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Langmead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Trapnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pop</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Salzberg</surname>
          </string-name>
          , “
          <article-title>Ultrafast and memory-efficient alignment of short dna sequences to the human genome,” Genome Biology</article-title>
          , vol.
          <volume>10</volume>
          , no.
          <issue>3</issue>
          , p.
          <fpage>R25</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Durbin</surname>
          </string-name>
          , “
          <article-title>Fast and accurate short read alignment with Burrows-Wheeler transform</article-title>
          ,” Bioinformatics, vol.
          <volume>25</volume>
          , no.
          <issue>14</issue>
          , pp.
          <fpage>1754</fpage>
          -
          <lpage>1760</lpage>
          ,
          <year>Jul 2009</year>
          . [Online]. Available: http://dx.doi.org/10.1093/ bioinformatics/btp324
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15] --, “
          <article-title>"fast and accurate long-read alignment with burrows-wheeler transform.",”</article-title>
          <string-name>
            <surname>Bioinformatics</surname>
          </string-name>
          , vol.
          <volume>26</volume>
          , no.
          <issue>5</issue>
          , p.
          <fpage>589</fpage>
          -
          <lpage>595</lpage>
          ,
          <year>Mar 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Handsaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wysoker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fennell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Homer</surname>
          </string-name>
          , G. Marth, G. Abecasis, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Durbin</surname>
          </string-name>
          , “The Sequence Alignment/Map Format and SAMtools,” Bioinformatics, vol.
          <volume>25</volume>
          , no.
          <issue>16</issue>
          , pp.
          <fpage>2078</fpage>
          -
          <lpage>2079</lpage>
          , Aug.
          <year>2009</year>
          . [Online]. Available: http://dx.doi.org/10.1093/ bioinformatics/btp352
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <article-title>“Long read alignment based on maximal exact match seeds</article-title>
          ,
          <source>” Bioinformatics</source>
          , vol.
          <volume>28</volume>
          , no.
          <issue>18</issue>
          , pp.
          <fpage>i318</fpage>
          -
          <lpage>i324</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Maskell</surname>
          </string-name>
          , “
          <article-title>Cushaw: a cuda compatible short read aligner to large genomes based on the burrows-wheeler transform</article-title>
          ,
          <source>” Bioinformatics</source>
          , vol.
          <volume>28</volume>
          , no.
          <issue>14</issue>
          , p.
          <year>1830</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Marguerat</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Bähler</surname>
          </string-name>
          , “
          <article-title>Rna-seq: from technology to biology</article-title>
          .”
          <source>Cell Mol Life Sci</source>
          , vol.
          <volume>67</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>569</fpage>
          -
          <lpage>579</lpage>
          ,
          <year>Feb 2010</year>
          . [Online]. Available: http://dx.doi.org/10.1007/s00018-009-0180-6
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Moorthie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Mattocks</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Wright</surname>
          </string-name>
          , “
          <article-title>Review of massively parallel dna sequencing technologies</article-title>
          ,”
          <source>Hugo Journal</source>
          , vol.
          <volume>5</volume>
          , no.
          <issue>1-4</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nadipalli</surname>
          </string-name>
          , HDInsight Essentials, 2nd ed.
          <source>Packt Publishing</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shvachko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Radia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Chansler</surname>
          </string-name>
          , “
          <article-title>The hadoop distributed file system</article-title>
          ,”
          <source>in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)</source>
          ,
          <article-title>ser</article-title>
          .
          <source>MSST '10</source>
          . Washington, DC, USA: IEEE Computer Society,
          <year>2010</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . [Online]. Available: http://dx.doi.org/10.1109/MSST.
          <year>2010</year>
          . 5496972
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Rayapati</surname>
          </string-name>
          ,
          <article-title>Learning Big Data with Amazon Elastic MapReduce</article-title>
          . Packt Publishing,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Snir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Otto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dongarra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Huss-Lederman</surname>
          </string-name>
          ,
          <article-title>MPI: The Complete Reference</article-title>
          . Cambridge, MA, USA: MIT Press,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Vavilapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Murthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Douglas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Konar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Curino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. O</given-names>
            <surname>'Malley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Radia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Reed</surname>
          </string-name>
          , and E. Baldeschwieler, “
          <article-title>Apache hadoop yarn: Yet another resource negotiator,” in Proceedings of the 4th Annual Symposium on Cloud Computing, ser</article-title>
          .
          <source>SOCC '13</source>
          . New York, NY, USA: ACM,
          <year>2013</year>
          , pp.
          <volume>5</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          :
          <fpage>16</fpage>
          . [Online]. Available: http://doi.acm.
          <source>org/10</source>
          .1145/2523616.2523633
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hadoop: The Definitive Guide. O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shenker</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Stoica</surname>
          </string-name>
          , “Spark:
          <article-title>Cluster computing with working sets,”</article-title>
          <source>in Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, ser.</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>HotCloud'10</source>
          . Berkeley, CA, USA: USENIX Association,
          <year>2010</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          10-
          <fpage>10</fpage>
          . [Online]. Available: http://dl.acm.org/citation.cfm?id=
          <fpage>1863103</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>