<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Secure Genome Processing in Public Cloud and HPC Environments</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">André</forename><surname>Brinkmann</surname></persName>
							<email>brinkman@uni-mainz.de</email>
							<affiliation key="aff0">
								<orgName type="department">Zentrum für Datenverarbeitung</orgName>
								<orgName type="institution">JGU Mainz</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jürgen</forename><surname>Kaiser</surname></persName>
							<email>kaiserj@uni-mainz.de</email>
							<affiliation key="aff0">
								<orgName type="department">Zentrum für Datenverarbeitung</orgName>
								<orgName type="institution">JGU Mainz</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><surname>Löwer</surname></persName>
							<email>martin.loewer@tron-mainz.de</email>
							<affiliation key="aff1">
								<orgName type="department">Translational Oncology</orgName>
								<orgName type="institution">JGU Mainz (TRON gGmbH)</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lars</forename><surname>Nagel</surname></persName>
							<email>nagell@uni-mainz.de</email>
							<affiliation key="aff0">
								<orgName type="department">Zentrum für Datenverarbeitung</orgName>
								<orgName type="institution">JGU Mainz</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ugur</forename><surname>Sahin</surname></persName>
							<email>ugur.sahin@tron-mainz.de</email>
							<affiliation key="aff1">
								<orgName type="department">Translational Oncology</orgName>
								<orgName type="institution">JGU Mainz (TRON gGmbH)</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tim</forename><surname>Süß</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Zentrum für Datenverarbeitung</orgName>
								<orgName type="institution">JGU Mainz</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Secure Genome Processing in Public Cloud and HPC Environments</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4FB93DB69321B018CFAFB1BDA3E22780</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T10:16+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>genome sequencing</term>
					<term>data security</term>
					<term>MapReduce</term>
					<term>Hadoop</term>
					<term>pipeline architecture</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Aligning next generation sequencing data requires significant compute resources. HPC and cloud systems can provide sufficient compute capacity, but do not offer the required data security guarantees. HPC environments are typically designed for many groups of trusted users and often only include minimal security enforcement, while Cloud environments are mostly under the control of untrusted entities and companies.</p><p>In this work we present a scalable pipeline approach that enables the use of public Cloud and HPC environments, while improving the patients' privacy. The applied techniques include adding noisy data, cryptography, and a MapReduce program for the parallel processing of data.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>High-throughput sequencing, also known as next-generation sequencing (NGS), is a revolutionary technology that enables the sequencing of entire genomes in a matter of hours. Common techniques like sequencing-by-synthesis require a lot of computational power for post-processing the genome data, that is, for aligning them to a reference sequence / genome and assembling them according to this mapping.</p><p>NGS techniques produce large amounts of genome data in the form of short reads (about 50 to 300 bases) which usually need to be aligned to a reference genome <ref type="bibr" target="#b20">[20]</ref>. NGS has not only revolutionized biological and medical research <ref type="bibr" target="#b19">[19]</ref>, but also offers opportunities for the treatment of diseases. The implementation of personalized medicine <ref type="bibr" target="#b10">[10]</ref>, for example, requires the analysis of human genomes at a large scale and involves sequencing genome data from thousands of individuals per year at one facility <ref type="bibr" target="#b7">[7]</ref>. The alignment of such an amount of data is only possible by utilizing the processing power of large computing centers.</p><p>Current solutions typically do not involve public cloud computing or academic high-performance computing (HPC) systems, but rely on expensive in-house facilities to ensure the privacy of the data. The reason is that the patients' data could be hijacked in public computing environments. HPC systems are shared by many groups of trusted users, and data security has a low priority. Cloud environments, on the other hand, may provide a slightly better data security, but they are mostly under the control of untrusted entities and companies. Hence, to exploit external computing clusters, it is necessary to make the transfer of sensitive genome data and their processing secure -without impacting performance too much.</p><p>In this paper we present a solution to this problem which outsources the computations to public multi-user HPC or cloud facilities while ensuring data security and fast processing. The techniques applied include adding noise, cryptography, and a novel method for the parallel processing of genome data of many patients. We argue that the system is not only inexpensive, but also secure and show that the system's performance is only slightly degraded by our security measures.</p><p>The remainder of the paper is structured as follows: In Section II we describe the scenario in more detail and derive five requirements. In Section III we discuss tools and solutions from the literature. In Section IV we outline our approach before we give a more detailed description of our pipeline architecture in Section V. In the evaluation in Section VI we check whether the requirements are fulfilled. Finally, we conclude the paper in Section VII.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. SCENARIO &amp; REQUIREMENTS</head><p>Every day in biological and medical research, large amounts of genomes are processed to determine their nucleotide sequences. Besides research, genome sequencing is vital for the personalized medicine of the present and future <ref type="bibr" target="#b10">[10]</ref>, for example, for individualized immunotherapies which were considered in the CI3 project <ref type="foot" target="#foot_0">1</ref> . The techniques of high-throughput sequencing or next-generation sequencing have accelerated the process and made it possible to sequence the entire human genome in one day for less than $1000 <ref type="bibr" target="#b8">[8]</ref> <ref type="bibr" target="#b6">[6]</ref>.</p><p>An important part of the process is performed on highperformance computers. Since sequencers produce their (digitalized) output in the form of short reads (i.e., sequences), the reads need to be arranged and interpreted. This is done by aligning the reads to a reference genome which is a very costly operation requiring a lot of computational power. At the same time, the data are highly confidential and must not be vulnerable to attackers. It is therefore not possible to simply transmit the data to a public cloud or high-performance computing environment. Yet, on the other hand, it is very / too costly for many institutes to abandon such cheap solutions and install a high-performance computing (HPC) facility in-house. For this reason, we consider the scenario in which a public HPC cluster is used, but under the condition that the data are secure at any time in that an attacker might steal information, but is not able to identify the person that these data belong to. We assume that the sequencing has already been performed and that the short reads are provided as files. We regard the environment where the data originate (e.g., some institute) as a secure environment with strict access rights enforcement. Thus, we only consider the transmission to and the processing at the HPC cluster as vulnerable. It can be accessed by potentially malicious users, and no data being processed or stored there can be deemed secure. So, malicious users may read the data and try to identify patients, either by reading the identifiers of the sequences or by aligning the data to a (not necessarily complete) genome of the patient that they obtained prior to the attack.</p><p>Altogether we have the following requirements:</p><p>1) The computationally expensive sequence alignment and assembly must be performed on a shared, potentially vulnerable HPC cluster.</p><p>2) The data should not be stored on disk in the cluster and must be completely deleted once the computation is finished. 3) Wherever possible, the data must be secured using strong encryption algorithms. 4) While it cannot be guaranteed that data are intercepted during computation in a public environment, it must not be possible to identify persons or probes based on the data. 5) There must be no significant degradation of the computational performance, for example, by using encryption or other means to provide data security. In our use case we considered a particular system where the data are generated by Illumina sequencers which produce reads of roughly 50 to 300 nucleotides. The design and implementation of the system were influenced by this use case, but most of the descriptions and results are valid for other types of sequencers as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. RELATED WORK</head><p>For processing big data, Hadoop <ref type="bibr" target="#b26">[26]</ref> is one of the best known MapReduce frameworks. Embarrassingly parallel jobs can be easily mapped onto this framework. In a MapReduce framework tasks are mapped onto nodes for processing, and the intermediate results are then reduced to final results <ref type="bibr" target="#b4">[5]</ref>. Hadoop consists of three main components: the MapReduce framework itself, the resource manager YARN <ref type="bibr" target="#b25">[25]</ref>, and the distrubuted file system HDFS <ref type="bibr" target="#b22">[22]</ref>. The HDFS file system is distributed over the complete Hadoop cluster and stores the job-related data. The resource manager YARN allows to execute applications distributed on a Hadoop cluster and to reserve the required resources.</p><p>There are many alternatives to Hadoop. In HPC systems, one usually uses MPI for job parallelization <ref type="bibr" target="#b24">[24]</ref>. Big vendors like Microsoft, Amazon or Google offer infrastructures for processing big data in the cloud <ref type="bibr" target="#b21">[21]</ref>[23] <ref type="bibr" target="#b11">[11]</ref>.</p><p>In recent years many tools for short-read alignment have been developed. The first ones focused on reads that consist of at most 100 nucleotide bases. Due to the small read size it was assumed that the quality / correctness of the sequences was high. Examples for these first-generation tools are Bowtie <ref type="bibr" target="#b13">[13]</ref>, Burrows-Wheeler Aligner <ref type="bibr" target="#b14">[14]</ref> and CUSHAW <ref type="bibr" target="#b18">[18]</ref>. In our environment we use the Burrows-Wheeler Aligner in combination with SAMtools <ref type="bibr" target="#b16">[16]</ref>.</p><p>New NGS sequencers are able to generate reads with longer base sequences. One drawback of these reads is that they contain more errors which the alignment tool must compensate. Aligners that follow the seed-and-extend paradigm, can deal with an increased number of incorrect bases. Examples are BWA-SW <ref type="bibr" target="#b15">[15]</ref>, Bowtie2 <ref type="bibr" target="#b12">[12]</ref> or CUSHAW2 <ref type="bibr" target="#b17">[17]</ref>.</p><p>There are some approaches that use big data techniques for their short-read alignments. Similar to our approach, Abuin et al. use the MapReduce framework Apache Spark <ref type="bibr" target="#b27">[27]</ref> for parallelizing the Burrows-Wheeler Aligner <ref type="bibr" target="#b1">[2]</ref>. However, they do not secure the processed data what disallows the usage of unsecure shared environments. Chen et al. presented another approach based on the seed-and-extend paradigm that uses public and private computing clouds for their computations <ref type="bibr" target="#b2">[3]</ref>. The public cloud is used to reduce the number of potential alignment positions while the reads' final positions are determined in the private cloud. In contrast to our approach, this method only aligns the reads of a single patient, whereas our approach processes the data of many patients in parallel.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. APPROACH</head><p>In this section we shortly describe the main ideas and techniques of our approach before we give a more detailed description of our architecture in the next section. As we stated in the requirements, the new system must guarantee data security and perform fast parallel alignment of sequencing data.</p><p>We use a pipeline architecture which we divide into four phases. The pre-and postprocessing Phases I, III and IV in the secure environment package and sort the data, but they mostly serve data security in Phase II. Phase II is the critical one in which the data are sent to and processed on the shared HPC cluster. During this complete phase the patient / probe identifiers of the reads are encrypted which is possible as they are not needed for the alignment. The rest of the data is only encrypted during transmission, but vulnerable when it is processed.</p><p>Since it is possible to identify patients based on parts of their DNA even if the respective genome sequences are in a larger pool of reads <ref type="bibr" target="#b2">[3]</ref>[9], we additionally obfuscate potential attackers by mixing large amounts of reads, salting the mix with additional fake ones and randomizing the order of all reads. If the noisy sequences are chosen cleverly, it is much more difficult for attackers to identify patients by applying statistical methods. In our pipeline, fake data is randomly picked from a large, fixed set of genome data.</p><p>For data security, we finally ensure that data are never stored in persistent memory in the cluster to reduce the number of side channels that could be used by attackers.</p><p>With respect to performance, we exploit the fact that sequences can be aligned independent of each other and use a parallel MapReduce program. The principle of MapReduce fits very well to our scenario <ref type="bibr" target="#b4">[5]</ref>: the "Map" step aligns the sequences and sorts the aligned reads, the "Reduce" step collects and outputs the data chromosome-wise. We chose Hadoop (with YARN) as MapReduce framework. For the alignment we use the Burrows-Wheeler Aligner (bwa) <ref type="bibr" target="#b14">[14]</ref> and for sorting the SAMtools tool suite <ref type="bibr" target="#b16">[16]</ref>.</p><p>The file formats that we use are a result of the design / software decisions. FASTQ is the format of the files produced by Illumina sequencers. The Sequence Alignment/Map (SAM) format "is a generic alignment format for storing read alignments against reference sequences" and the Binary Alignment/Map (BAM) format is the binary representation of it <ref type="bibr" target="#b16">[16]</ref>. Both formats are supported by the Burrows-Wheeler Aligner and the SAMtools.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. PIPELINE ARCHITECTURE</head><p>This section describes our pipeline architecture which we restrict to the computational part; i.e., we ignore the data acquisition by the sequencers and only consider the process between the point at which the input data in the form of short reads are gathered on computers in the safe environment and the point at which the aligned reads are written back to the safe environment. We assume that the short reads are contained in FASTQ files and that the results are returned in BAM format. Hence, our solution consists of a pipeline that takes FASTQ files as input and outputs the aligned reads as BAM files.</p><p>The pipeline is shown in Figure <ref type="figure" target="#fig_0">1</ref> and can be divided into four phases. Each phase takes the output data of the previous one as input and creates a new output. For security and performance reasons, the input data are first anonymized and divided into work packages. In Phase II, each work package is processed in its own Hadoop job in an HPC system. Back in the secure area, the output is de-anomymized and reordered in the Decrypt &amp; Unmix phase. Finally, the Merger merges all files of a patient or probe into one final output file.</p><p>A detailed explanation of the individual steps is given in the following subsections. The last subsection describes what the user has to do to trigger the pipeline process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Phase I: Anonymizer</head><p>In the first phase, the Anonymizer reads the input dataconsisting of single or paired-ended reads of many patientsfrom local storage in the secure area, adds encrypted identifiers, salts the input with fake data, and creates work packages (wp) of randomly mixed reads that are written back to local storage. In the next paragraphs we describe these tasks in more detail.</p><p>The input is given as files in FASTQ format which can be gzip-compressed. If compressed, the file names must end with .gz. The Anonymizer can handle single-ended as well as paired-ended reads. It looks for single-ended input files using the regular expression .+.fastq. * and and for paired-ended input files using .+_[1|2].fastq. * . Two paired-ended input files are assumed to belong together if their prefixes before _1 and _2, repectively, are the same. The shared prefix is then used as the identifier or probe name. The probe name of an input file with single-ended reads is the file name without the .fastq ending.</p><p>For later processing stages, the patient / probe of each read must be retraceable because the Anonymizer mixes reads from several patients / probes into the same work package. If this information was not added, the later stages could not map the reads back to their respective patients / probes. For this reason, the Anonymizer augments each read (pair) with a patient / probe identifier, which is then piggybacked through the following stages. This is done before encryption. The Anonymizer adds the identifier to the first line of a read entry. More precisely, it prepends the patient / probe identifier to the existing read identifier separating both by a #. If, for example, the first line reads @foo and if the patient ID is Pa, the concatenation becomes @Pa#foo.</p><p>The anonymization is achieved by combining the following techniques:</p><p>1) Salting: The actual data consisting of many different probes are salted by throwing dummy reads into the mix. These dummy reads contain, for example, rare SNPs so that individual patients cannot be identified based on special genome segments. An attacker simply cannot distinguish between real and fake data which are processed like real data, but eventually filtered out in Phase III. 2) Mixing: Real and fake data are randomly assigned to the work packages which are sent to the cluster and which form the input of Phase II. This way, the attacker cannot gather the relationship of two reads from their positions in the work packages. 3) Encryption: The identifier is encrypted using the Advanced Encryption Standard (AES) as defined in the U.S. Federal Information Processing Standards Publication 197 <ref type="bibr">[1][4]</ref>. As we cannot maintain the encryption ordering in the later decryption step, the electronic code book (ECB) mode is used. To avoid that equal plain text blocks are encrypted into identical ciphertext blocks, we salt the identifiers by adding random bits to them at fixed positions. The Anonymizer supports AES-128, AES-192 and AES-256 encryption. The respective key of 16, 24 or 32 bytes can be provided in form of an input file. If no (valid) keystore file is given, the Anomizer reads 32 bytes from /dev/random and uses them as the key. Note that the encrypted identifier is not directly written to the output file as it may include characters which render it invalid for later processing. Therefore, the encrypted data are base64-encoded before they are written to the output file.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Phase II: Hadoop</head><p>The second phase computes the alignment of the reads with respect to a given reference genome. It is the only one executed in a shared computing center. When the data are sent to the computing center and back, OpenSSH is used for encryption. The computation is organized subject to the MapReduce model and uses the Hadoop framework with YARN.</p><p>Each work package generated in Phase I is processed by a separate Hadoop job. As explained in Section IV, the Map function computes the alignments for all reads using the Burrows-Wheeler Aligner and sorts the output reads by their position in the reference genome using the SAMtools utilities. After that, the Hadoop partitioner partitions the reads for the Reduce function. Our partitioner partitions them by their chromosomes and sends them to the reducers where each reducer represents one chromosome. The actual Reduce function outputs the data without further processing. For each chromosome one output file is created.</p><p>Considered in more detail, a map task runs three instances of the Burrows-Wheeler Aligner bwa. Assuming paired-ended (single-ended) reads, it performs two bwa aln calls which align the reads and one bwa sampe (bwa samse) call which transforms the results to SAM format.</p><p>To further improve security, the YARN environment is configured to use only local RAM disks of the compute nodes. Hence, no read is stored on persistent storage during computation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Phase III: Decrypt &amp; Unmix</head><p>The Decrypt &amp; Unmix stage processes the output (chromosome) files of the Hadoop jobs. Our tool generates one output file for every patient / probe and work package, loops through all reads in the input, decrypts their identifiers and sorts them in the right output files. Fake reads are discarded.</p><p>Input files are identified by the regular expression .+_chr.+, and all other files in the input directory are ignored. It is assumed that the input files are in SAM format, and missing header lines are tolerated. If the tool finds header lines in one of the input files, it copies them to an additional output file named "header". The ordering of the header lines is not changed during this process. The output files are also in SAM format (but without header lines). The name of an output file specifies a patient / probe name, a work package name and a chromosome and follows the pattern &lt;probe&gt;_&lt;workpackage&gt;_&lt;chromosome&gt;.</p><p>The decrypt &amp; unmix tool iterates over the input files and, in each file, over all reads in the given order. For each read, it decrypts its identifier and simply appends the read to the output file of the respective patient / probe, work package and chromosome. If there is no such file, it is created.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Phase IV: Merger</head><p>The final step in the pipeline is the Merger which merges the output files of Phase III such that there is one single file for each patient / probe.</p><p>The input has one file for each combination of patient / probe, work package and chromosome. Each file is a list of reads sorted by their position within the chromosome. Hence, it remains to join the data for each patient / probe in one file using a merge operation similar to the one of the merge sort algorithm. The Merger uses the regular expression .+_wp.+ to identify input files. The prefix before the underscore is the identifier of the patient / probe.</p><p>The Merger must wait for Phase III to finish and can therefore not run parallel to it. It iterates over the input files in chromosome order and, in each file, over all reads in the given order. For each patient, it first merges the data from all her files belonging to chromosome 1, then from all files belonging to chromosome 2 and so on. In each step, it selects the read with the lowest position and adds it the output file of the according patient.</p><p>While the entries in a chromosome file are already sorted by their mapping position in the reference genome, the chromosome order must be established using the file "header", provided that it was created in Phase III, or the reference genome's annotation file ( * .ann) or sequence dictionary ( * .dict). If none of these files exist, the tool aborts.</p><p>The Merger uses heuristics and parameters to estimate the number of merges that can be performed in parallel and reserves CPU cores and memory accordingly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Pipeline Usage</head><p>Before the pipeline process is started, the user has to define input and output directory and provide the data. Once started, the pipeline does not require any interventions from the user. So, the steps are:</p><p>1) Define workspace directory (&lt;workspace&gt;) and input directory (e.g., &lt;workspace&gt;/input). 2) Copy the configuration file to &lt;workspace&gt;/input and modify it if necessary. 3) Call the starter script: starter -i &lt;workspace&gt;/ input -w &lt;workspace&gt; 4) Wait for the pipeline to finish. 5) Obtain results from &lt;workspace&gt;/results. The configuration specified in the configuration file is related to the cluster side and can be determined by the provider: Pipeline / cluster settings (size of work packages, #nodes, #threads, #pseudo samples, host name, user group on host, batch queue on host etc.), tool settings (Hadoop, BWA, SAMtools and Java) and the location of additional data (reference genome, pseudo genome data).</p><p>During the run, the pipeline creates a heavy IO workload in the workspace directory. Hence, it is advisable to put the workspace on a strong storage backend.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. EVALUATION</head><p>In this section we evaluate whether the system meets the five requirements of Section II. The first and the second one are obviously fulfilled because we perform all costly operations on a public HPC cluster and we do this without storing any data on disk.</p><p>The third requirement is met because the data are during transmission to and from the cluster. On the cluster nodes themselves the sequences are readable, but for processing they have to be.</p><p>The identifiers of the patients / probes, however, remain encrypted even on the cluster nodes which helps with the fourth requirement which is a bit more tricky. While it is not possible to read the identifier without breaking the encryption, it might be possible to identify a person in the mix by reading sufficiently many sequences, mapping them to an old reference genome of that person (provided that the attacker has one) and computing the statistical probability for that person to be in the mix. Chen et al. <ref type="bibr" target="#b2">[3]</ref> and Homer et al. <ref type="bibr" target="#b9">[9]</ref> showed that this is possible to a certain degree utilizing the existence of rare SNPs, but that it becomes more and more difficult the more other reads are in the mix. Unfortunately, there is no precise analysis of how many and what data the mix has to be salted with to render this attack ineffective. In our approach we usually at least double the amount of data and pick reads with rare SNPs so that the patients' peculiarities do not stand out.</p><p>Concerning the quality of the encryption, we pointed out before that we have to use AES in ECB mode, but that we avoid identical ciphertext blocks in the case of identical plain text blocks by adding random bits to the plain text. For this reason, the encryption techniques, AES and OpenSSH, can be assumed to be strong.</p><p>For analyzing the fifth requirement, we ran tests in an HPC cluster. The "secure environment" is a remote computer, but connected via a high-speed network link (10 GB/s). Figure <ref type="figure" target="#fig_1">2</ref> shows processing times of the pipeline for an input data set of 237M short reads (100 GB) and different cluster sizes, each node having 64 cores and 128 GB memory. The left plot indicates that the processing scales extremely well. The running time is roughly quartered when the number of nodes is quadrupled.</p><p>The right figure shows the relative processing times of the work packages. For a small number of nodes, the upload to the cluster dominates the total time because the cluster nodes can store only a limited number of work packages in their memory and the packages must remain in the secure area until space in the cluster nodes is freed. For a large number of nodes, however, almost all the time is spent on processing, and the overhead of using an outside cluster is neglectable.</p><p>In both figures, one can see that the runtimes of Phase I (Anonymizer) and III (Decrypt &amp; Unmix) are relatively short so that the overhead due to anonymizing / encryption and deanonymizing / decryption is acceptable. Note that since work package processing starts as soon as the first work package arrives at the cluster, the runtimes of the Anonymizer and the work package processing also overlap.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. CONCLUSION</head><p>In this work we have presented a system for processing sensitive genome data in a public environment without harming the privacy of patients. Our secure genome processing pipeline is already used for processing data of real cancer patients in a university's HPC cluster which is shared with many other users. As shown in our evaluation, the system's processing performance is very good and scales excellently.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>9thFig. 1 :</head><label>1</label><figDesc>Fig. 1: Overview of the pipeline.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: Left: absolute runtimes of Anonymizer, Merger, and the Hadoop processing in between. Right: runtime for processing the work packages relative to other operations (DU: decrypt &amp; unmix).</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.ci-3.de/en</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENT</head><p>This work was supported by the German Federal Ministry of Education and Research (BMBF) under grant 131A029D (Project "CI3").</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Announcing the advanced encryption standard (aes)</title>
	</analytic>
	<monogr>
		<title level="j">Federal Information Processing Standards Publication</title>
		<imprint>
			<biblScope unit="volume">197</biblScope>
			<date type="published" when="2001">2001</date>
			<publisher>Tech. Rep</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Sparkbwa: Speeding up the alignment of high-throughput dna sequencing data</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Abuín</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Pichel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">F</forename><surname>Pena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Amigo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PLoS ONE</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">5</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Large-scale privacypreserving mapping of human genomic sequences on hybrid clouds</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Tang</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
			<publisher>NDSS</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">The Design of Rijndael</title>
		<author>
			<persName><forename type="first">J</forename><surname>Daemen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Rijmen</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002">2002</date>
			<publisher>Springer-Verlag New York, Inc</publisher>
			<pubPlace>Secaucus, NJ, USA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Mapreduce: Simplified data processing on large clusters</title>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghemawat</surname></persName>
		</author>
		<idno type="DOI">10.1145/1327452.1327492</idno>
		<ptr target="http://doi.acm.org/10.1145/1327452.1327492" />
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="107" to="113" />
			<date type="published" when="2008-01">Jan. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m">9th International Workshop on Science Gateways</title>
				<meeting><address><addrLine>IWSG</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-06">2017. June 2017</date>
			<biblScope unit="page" from="19" to="21" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">New Machines Can Sequence Human Genome in One Day</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">J</forename><surname>Fikes</surname></persName>
		</author>
		<ptr target="http://www.sci-tech-today.com/news/Genome-Sequencing-in-One-Day/story.xhtml?story\_id=023001ATQM02" />
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Genomic and Personalized Medicine</title>
		<ptr target="http://www.sciencedirect.com/science/article/pii/B978012382227700121X" />
		<editor>G. S. Ginsburg and H. F. Willard</editor>
		<imprint>
			<date type="published" when="2013">2013</date>
			<publisher>Academic Press</publisher>
		</imprint>
	</monogr>
	<note>Second Edition. second edition ed</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The sequence of sequencers: The history of sequencing {DNA}</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Heather</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chain</surname></persName>
		</author>
		<ptr target="http://www.sciencedirect.com/science/article/pii/S0888754315300410" />
	</analytic>
	<monogr>
		<title level="j">Genomics</title>
		<imprint>
			<biblScope unit="volume">107</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="8" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays</title>
		<author>
			<persName><forename type="first">N</forename><surname>Homer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Szelinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Redman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Duggan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Tembe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Muehling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">V</forename><surname>Pearson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Stephan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">F</forename><surname>Nelson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Craig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PLoS Genet</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="1" to="9" />
			<date type="published" when="2008-08">08 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Mutant mhc class ii epitopes drive therapeutic immune responses to cancer</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vormehr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Van De Roemer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Diken</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Löwer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Diekmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Boegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Schrörs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Vascotto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Castle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Tadmor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">P</forename><surname>Schoenberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Huber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Özlem</forename><surname>Türeci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Sahin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature</title>
		<imprint>
			<biblScope unit="volume">520</biblScope>
			<biblScope unit="page" from="692" to="696" />
			<date type="published" when="2015-03">Mar. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Building Your Next Big Thing with Google Cloud Platform: A Guide for Developers and Enterprise Architects</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Krishnan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">U</forename><surname>Gonzalez</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>Apress</publisher>
			<pubPlace>Berkely, CA, USA</pubPlace>
		</imprint>
	</monogr>
	<note>1st ed</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">fast gapped-read alignment with bowtie 2</title>
		<author>
			<persName><forename type="first">B</forename><surname>Langmead</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">L</forename><surname>Salzberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature methods</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="357" to="359" />
			<date type="published" when="2012-03">Mar 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Ultrafast and memory-efficient alignment of short dna sequences to the human genome</title>
		<author>
			<persName><forename type="first">B</forename><surname>Langmead</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Trapnell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pop</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">L</forename><surname>Salzberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Genome Biology</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page">R25</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Fast and accurate short read alignment with Burrows-Wheeler transform</title>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Durbin</surname></persName>
		</author>
		<idno type="DOI">10.1093/bioinformatics/btp324</idno>
		<ptr target="http://dx.doi.org/10.1093/bioinformatics/btp324" />
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">14</biblScope>
			<biblScope unit="page" from="1754" to="1760" />
			<date type="published" when="2009-07">Jul 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">fast and accurate long-read alignment with burrows-wheeler transform</title>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="589" to="595" />
			<date type="published" when="2010-03">Mar 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">The Sequence Alignment/Map Format and SAMtools</title>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Handsaker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wysoker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Fennell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ruan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Homer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Marth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Abecasis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Durbin</surname></persName>
		</author>
		<idno type="DOI">10.1093/bioinformatics/btp352</idno>
		<ptr target="http://dx.doi.org/10.1093/bioinformatics/btp352" />
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">16</biblScope>
			<biblScope unit="page" from="2078" to="2079" />
			<date type="published" when="2009-08">Aug. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Long read alignment based on maximal exact match seeds</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Schmidt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">18</biblScope>
			<biblScope unit="page" from="318" to="324" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Cushaw: a cuda compatible short read aligner to large genomes based on the burrows-wheeler transform</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Maskell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">14</biblScope>
			<biblScope unit="page">1830</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Rna-seq: from technology to biology</title>
		<author>
			<persName><forename type="first">S</forename><surname>Marguerat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bähler</surname></persName>
		</author>
		<idno type="DOI">10.1007/s00018-009-0180-6</idno>
		<ptr target="http://dx.doi.org/10.1007/s00018-009-0180-6" />
	</analytic>
	<monogr>
		<title level="j">Cell Mol Life Sci</title>
		<imprint>
			<biblScope unit="volume">67</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="569" to="579" />
			<date type="published" when="2010-02">Feb 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Review of massively parallel dna sequencing technologies</title>
		<author>
			<persName><forename type="first">S</forename><surname>Moorthie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Mattocks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">F</forename><surname>Wright</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Hugo Journal</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">1-4</biblScope>
			<biblScope unit="page" from="1" to="12" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Nadipalli</surname></persName>
		</author>
		<title level="m">HDInsight Essentials</title>
				<imprint>
			<publisher>Packt Publishing</publisher>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note>2nd ed</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">The hadoop distributed file system</title>
		<author>
			<persName><forename type="first">K</forename><surname>Shvachko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Radia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Chansler</surname></persName>
		</author>
		<idno type="DOI">10.1109/MSST.2010.5496972</idno>
		<ptr target="http://dx.doi.org/10.1109/MSST.2010.5496972" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), ser. MSST &apos;10</title>
				<meeting>the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), ser. MSST &apos;10<address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="1" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Learning Big Data with Amazon Elastic MapReduce</title>
		<author>
			<persName><forename type="first">A</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Rayapati</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
			<publisher>Packt Publishing</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Snir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>Otto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Walker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dongarra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huss-Lederman</surname></persName>
		</author>
		<title level="m">MPI: The Complete Reference</title>
				<meeting><address><addrLine>Cambridge, MA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>MIT Press</publisher>
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Apache hadoop yarn: Yet another resource negotiator</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">K</forename><surname>Vavilapalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Murthy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Douglas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Konar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Evans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lowe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Seth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Saha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Curino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>O'malley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Radia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Reed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Baldeschwieler</surname></persName>
		</author>
		<idno type="DOI">10.1145/2523616.2523633</idno>
		<ptr target="http://doi.acm.org/10.1145/2523616.2523633" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4th Annual Symposium on Cloud Computing, ser. SOCC &apos;13</title>
				<meeting>the 4th Annual Symposium on Cloud Computing, ser. SOCC &apos;13<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page">16</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<title level="m" type="main">Hadoop: The Definitive Guide</title>
		<author>
			<persName><forename type="first">T</forename><surname>White</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
			<publisher>O&apos;Reilly Media, Inc</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Spark: Cluster computing with working sets</title>
		<author>
			<persName><forename type="first">M</forename><surname>Zaharia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chowdhury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Franklin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shenker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Stoica</surname></persName>
		</author>
		<ptr target="http://dl.acm.org/citation.cfm?id=1863103.1863113" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, ser. HotCloud&apos;10</title>
				<meeting>the 2Nd USENIX Conference on Hot Topics in Cloud Computing, ser. HotCloud&apos;10<address><addrLine>Berkeley, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>USENIX Association</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="10" to="10" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
