<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicholas Tucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacek Cala</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jannetta Steyn</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Missier</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Ingegneria Elettronica, Universita Roma Tre</institution>
          ,
          <addr-line>Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing, Newcastle University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2</volume>
      <fpage>4</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Scalable and e cient processing of genome sequence data, i.e. for variant discovery, is key to the mainstream adoption of High Throughput technology for disease prevention and for clinical use. Achieving scalability, however, requires a signi cant e ort to enable the parallel execution of the analysis tools that make up the pipelines. This is facilitated by the new Spark versions of the well-known GATK toolkit, which o er a black-box approach by transparently exploiting the underlying Map Reduce architecture. In this paper we report on our experience implementing a standard variant discovery pipeline using GATK 4.0 with Docker-based deployment over a cluster. We provide a preliminary performance analysis, comparing the processing times and cost to those of the new Microsoft Genomics Services.</p>
      </abstract>
      <kwd-group>
        <kwd>Next Generation Sequencing</kwd>
        <kwd>distributed processing</kwd>
        <kwd>Spark</kwd>
        <kwd>cluster computing</kwd>
        <kwd>genomics</kwd>
        <kwd>variant analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The ability to e ciently analyse human genomes is a key component of the
emerging vision for preventive, predictive, and personalised medicine [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Genome
analysis aims to discover genetic variants that help diagnose genetic diseases in
clinical practice, or predict risk factors e.g. for certain types of cancer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. A
single exome contains about 10-15GB of data (encoded as a compressed FastQ
le), while a whole genome totals up to 1TB. Depending on the speci c kind
of analysis, state of the art variant discovery and interpretation processes may
take up to 10 hours to process a single exome. As whole-genome sequencing
at population scale becomes economically a ordable, personalised medicine will
therefore increasingly require scalable variant analysis solutions.
      </p>
      <p>
        With some variations, variant discovery consists of a pipeline where data
ows through a number of well-understood steps, from the raw reads o the
sequencing machine, to a list of functionally annotated variants that can be
interpreted by a clinician. A number of algorithms, often implemented as open
source and publicly available programs, are normally employed to implement
each of the steps. A notable example is the GATK suite of programs from the
Broad Institute [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which forms the basis for the study presented in this paper,
and is described more in detail below.
      </p>
      <p>
        The most promising approach for improving the e ciency of the pipeline is
to try and exploit the latent parallelism that may be available in some of the
data as well as in the algorithms. In particular, there is increasing evidence that
Hadoop-based implementations of deep genomic pipelines deployed on a
cloudbased cluster can outperform equivalent pipelines that require HPC resources [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
In our own work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we have shown that a work ow-based implementation that
runs on a public cloud infrastructure (Azure) scales better than a script-based
HPC version, while providing better cost control. The prevalent approach to
achieve parallelism at the level of the single program (see Sec. 1.2) involves
partitioning the input to the program in such a way that multiple instances can
be executed in parallel, one on each partition, with a merge step at the end.
Clearly, this split-and-merge pattern only works when the data chunks can be
processed independently of one another. In such a case, existing tools can be
wrapped as part of the pattern, without modi cation. Recently, however, a new
generation of GATK programs have been released (4.0, in beta version at the time
of writing), which re-implement a number of the algorithms as Spark programs.
In this approach, the task of achieving parallelism is essentially delegated to the
Spark infrastructure in combination with HDFS for dataset partitioning.
      </p>
      <p>In this paper we present an initial analysis of the new GATK facilities. We
have implemented the reference GATK pipeline in Spark, using the new 4.0
programs when possible, and by wrapping the programs that have not been
ported to Spark. In the rest of the paper we describe this hybrid approach, report
on the e ort involved in deploying the pipeline both on a single-node Spark
con guration and on a cluster, and present an initial performance evaluation on
the Azure cloud for a variety of Spark settings, VM con gurations, and cluster
sizes.</p>
      <p>
        When variant discovery pipelines are used for research purposes, transparency
and control over pipeline composition are important factors to consider,
especially in view of the rapid advances in the tools. An example of open-source
platform is the Genome Variant Investigation Platform (GenomeVIP) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which
employs GATK in addition to a number of other third party tools. On the other
end of the spectrum, \black box" variant discovery services are now being
offered, notably the new Microsoft Azure Genomics Services. Thanks to a grant
from Microsoft, we were able to compare the GATK Spark approach with the
new Microsoft Azure Genomics Services. We conclude that the Genomics
Services are currently both faster and more cost-e ective, when the Spark pipeline
is deployed on the Azure cloud and the Spark processing times are translated
into commercial rates. These results are preliminary, however, as GATK Spark
tools are still in beta at the time of writing.
1.1
      </p>
      <p>
        The Variant analysis pipeline
We begin by describing the target pipeline in some detail. The pipeline is roughly
aligned with the GATK Best Practices guidelines and incorporates the latest
GATK 4.0 Spark tools. Broadly speaking, it consists of three main phases, as
indicated in Fig. 1, namely Pre-processing, Variant Discovery, and Call Set
Renement. The pre-processing phase takes the input raw exome dataset, in the
FASTQ format, it aligns its content (unmapped reads of gene base pairs) against
a reference genome like h19 or h38, using the well-known BWA aligner [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and it
marks any duplicates, i.e., by agging up multiple paired reads that are mapped
to the same start and end positions. These reads often originate erroneously from
DNA preparation methods. They will cause biases that skew variant calling and
hence should be removed, in order to avoid them in downstream analysis. The
BQSR (Base Quality Score Recalibration) step then assigns con dence values to
each of the aligned reads, taking into account possible sequencing errors. Finally,
Variant Calling, performed using the GATK Haplotype Caller, identi es both
single-nucleotide polymorphisms (SNPs) as well as insertion/deletion mutations
(Indels).
      </p>
      <p>
        Multiple variant les (gVCF), one for each sample, are then bundled together
for the next phase, Variant Discovery. The speci c steps include producing raw
SNP and Indel VCF les, building recalibration models for those SNPs and Indels
and re ning the genotypes, that is, ltering out genotypes with low estimated
accuracy. The nal phase, Variant Annotation, is not part of the Best Practices
and thus may be implemented using a variety of third party tools. We used
Annovar, a well-known tool for functionally annotating genetic variants detected
from diverse genomes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. As mentioned later, pre-processing time dominates
the entire processing time and thus our performance analysis ignores phases two
and three. However, in the following we highlight some of the implementation
challenges for these steps.
1.2
      </p>
      <p>
        Related work
SparkSeq [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a general-purpose library for genomic cloud computing built on
top of Spark. Its strengths are its generality and extensibility, as it can be used
to build customised analysis pipelines (in Scala). It appears that the library is
built from the ground up, i.e., without leveraging existing implementations such
as GATK.
      </p>
      <p>
        In contrast, a general big data platform for genome data analysis, called
Gesall, that uses a wrapper approach to reuse existing tools without change is
presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Gesall leverages the potential parallelism that is available from
some of the existing tools, for instance BWA, by partitioning its input (SAM and
BAM les) and then managing the parallel execution of multiple BWA instances.
https://software.broadinstitute.org/gatk/best-practices/
https://software.broadinstitute.org/gatk/documentation/article.php?id=11081
https://software.broadinstitute.org/gatk/documentation/article.php?id=2805
Making this work, however, requires a heavy stack of new MapReduce-based
software to be injected between the data layer (HDFS) and the native tools.
      </p>
      <p>
        A similar approach, namely to segment input data sets and then feed them
to multiple instances of the tools, is presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The distinctive element
of the resulting framework is to perform load balancing by dividing
chromosomal regions according to the number of reads mapped to each chromosome,
as opposed to natural chromosome boundaries. This equalizes the size of each
data chunk and, in addition to in-memory data management, achieves
substantial speedup over a functionally equivalent but naively implemented Hadoop
MapReduce based solution. The advantages of in-memory processing for e
cient genome analysis have also been demonstrated recently in other ad hoc
frameworks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Yet another parallel version of a genomics pipeline that operates
by partitioning the input data les is described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this instance, however,
some of the tools have been re-implemented (as opposed to simply wrapped) to
explicitly leverage the embarrassingly parallel steps of the pipeline.
      </p>
      <p>In contrast to these e orts, in our experiments we aim to show the potential
of the tool re-implementation approach o ered by the GATK 4.x tool suite,
which are being incrementally ported to the Spark architecture.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Spark hybrid pipeline implementation</title>
      <p>As mentioned, the main motivation for undertaking this work has been to
experiment with a Spark implementation of the GATK Best Practices pipeline,
based on the recent release of GATK 4.0. Not only are these tools natively built
for Spark, but also, compared to the previous version (GATK 3.8), they are also
better integrated with each other, for instance to avoid writing intermediate les
to disk and increase e ciency.</p>
      <p>At the time of writing, however, these new versions of the tools are
limited to the pre-processing phase: BwaAndMarkDuplicatesPipelineSpark,
BQSRPipelineSpark and HaplotypeCallerSpark (Fig. 1). Thus, the
implementation necessarily required a hybrid approach, whereby pre-processing used the
new Spark tools, while for the rest of the pipeline we used a wrapper method.
For this, Spark o ers a transformation called Pipe, which \pipes each partition
of the RDD through a shell command, e.g. a Perl or bash script. RDD elements
are written to the process's stdin and lines output to its stdout are returned
as an RDD of strings". Thus, Pipe allows Bash scripts to execute from within
Spark, but not e ciently, as pipelining across the steps requires the content of
intermediate RDDs to be written out to les and then be read back in.
Looking at Fig. 1, it should be clear that the variant discovery phase is a potential
bottleneck, as it must process the entire batch of samples, with no parallelism
available. However, as it turns out its processing time is negligible compared to
that of pre-processing.
Sample 1</p>
      <p>SaSmapmlepl1e 1
FastqToSam</p>
      <p>FaFsatqstTqoTSoaSmam
lsaepm MBawrkaDBuwplicates
s MMarakDrkuDBpuwlipcalaictaaestes
N</p>
      <p>BQBSQRSR</p>
      <p>BQSR
Haplotype
CalHlearpSlpoatyrkpe
CalHlearpSlpoatyrkpe
CallerSpark
GATK 4.0
Spark tools</p>
      <p>Genotype</p>
      <p>VCFs
Recalibration
Genotype
Refinement
Select Variants</p>
      <p>)
lep ilta</p>
      <p>n
sam equ
1 (se</p>
      <p>GATK 3.8
Wrapped tools</p>
      <p>ANANNAONNVONAVORAVRAR</p>
      <p>IGM
AnInGoM</p>
      <p>
        AnAInGnoMno
Exonic
FiEltxeornic
FiEltxeornic
Filter
s
e
l
p
m
sa
N
The hybrid native Spark/wrapper approach works well for a single-node
deployment, as the entire pipeline can be launched using a single bash script that
encapsulates the communication with the Spark driver. For a batch of N
samples, the spark-submit command spawns one iteration per sample for the
preprocessing (BwaAndMarkDuplicatesPipelineSpark, BQSRPipelineSpark, and
HaplotypeCallerSpark), followed by a single VariantDiscovery for the entire
batch and N calls of CallsetRefinement. The results produced by the
execution have been validated against those obtained from our more established,
work ow-based pipeline as described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
In theory, Spark is designed to facilitate the seamless scaling out of
applications over a cluster, with virtually no change to the code. The pre-processing
phase of our pipeline would bene t the most from distribution, as it consists
of native Spark applications as explained earlier. In reality, the deployment of a
complex multi-tool pipeline like the one described requires substantial additional
e ort, mainly due to the requirement for Spark tools to read input and reference
datasets from a HDFS data layer.
      </p>
      <p>Commercial solutions such as Microsoft Azure HDInsight provide a
precongured environment ready to execute Spark in cluster mode. This comes at a
substantial cost, however (about twice the cost of an un-con gured set of VMs).
We therefore undertook the challenge of a manual Spark cluster con guration.
In this section we report on our experience realising a distributed version of the
pipeline using a virtualisation approach, based on Docker Swarm technology.
Our conclusion is that while Swarm greatly simpli es deployment, manual e ort
is still required especially to satisfy the data access requirements of the
various components, and limitations are incurred for the fragments of the pipeline
that are implemented using the wrapper method as explained earlier. Also, a
distributed deployment is not always bene cial due to the additional
communication overhead associated with a distributed execution, as we show in Sec. 3.</p>
      <p>Swarm extends Docker by providing seamless and automated distribution of
Docker containers over a cluster of VMs. A swarm is a group of machines nodes,
that run Docker containers and are joined into a cluster. The usual Docker
commands are executed on a cluster by a Swarm Manager.</p>
      <p>Swarm managers may employ several strategies to run containers, such as
\emptiest node", which lls the least utilized machines with containers, or
\global", which ensures that each machine gets exactly one instance of the
speci ed container. Swarm managers are the only machines in a swarm that can
execute user commands, or authorize other machines to join the swarm as
workers. Workers only provide capacity and do not have the authority to tell any
other machine what it can or cannot do. In this context, a service is an image
for an application that resides in a container and that is deployed over a swarm.</p>
      <p>We have used Docker Swarm to deploy both Spark and HDFS over a cluster
of nodes, using Docker Hub and Docker Images provided by Big Data Europe,
as follows. The rst step is to create a Swarm, which in our test cluster consists
of three nodes: a Swarm Manager and two Swarm Workers as shown in Fig. 2.
As both Spark and HDFS adopt Master-Slave architecture, the masters (Spark
Master and HDFS Namenode) are deployed on the Swarm Manager. The Slaves
(Spark workers and HDFS Data nodes) are deployed globally, that is, one replica
is allocated to each node in the Swarm, including the Swarm Manager node. The
Docker containers that host these images are connected through a dedicated
overlay network.</p>
      <p>Shared data, including all input samples, reference databases, GATK
libraries, etc., reside on HDFS and are therefore naturally distributed and
replicated over the Data nodes across the cluster. For the most part, this achieves
location transparency as tools need only access the data through Spark HDFS
drivers (readers and writers). There are two exceptions, however. Firstly,
nonSpark tools expect data to be accessible on a local le system. This is achieved
by mounting HDFS Data nodes as virtual Docker volumes so they are accessible
from within a Docker container. Secondly, the reference genome had to be
replicated to each local Worker le system (see reference image in Fig. 2). This is
achieved by encapsulating the dataset itself as a Docker Image container, which
is then automatically deployed by Swarm using the \global" Swarm mode, as
indicated above. One advantage of this encapsulation approach is that it makes it
easier to upgrade the reference genome, eg from h19 to h38.p1, the most recent.
https://docs.docker.com/engine/swarm/
https://www.big-data-europe.eu/
A key observation, already made earlier, is that none of the non-Spark programs
that make up the pipeline can be distributed. This is the case for the initial
step, FastqToSam, as well as for all the steps after pre-processing, which are
necessarily executed on the Spark Master container. As the processing time is
linear in the number of samples, this justi es allocating a larger VM to the Spark
Master.</p>
      <p>With this in mind, execution on a cluster consists of four main steps,
controlled by a master bash script. These are summarised in Fig. 3. The rst step,
FastqToSam, is non-Spark and produces local uBAM les, which then needs to be
distributed across the HDFS nodes (step 2) to be made available to the Spark
pre-processing tools (step 3). As explained, these tools communicate through
HDFS les and at the time of writing are not easy to integrate more deeply,
i.e., by sharing intermediate datasets using Spark process memory. Finally, step
4 consists of the execution of non-Spark tools, again on the Spark Master. This
requires that outputs that reside on HDFS be moved back to the local le system.</p>
      <p>In summary, the deployment may bene t from a partial porting of GATK
tools to Spark, however non-GATK tools that escape this porting e ort represent
bottlenecks. Firstly, because they run in centralised mode, and secondly because
of the di erent le infrastructure they require. Also, Spark tools appear to be
designed in isolation, without attempting to eliminate intermediate data passing
through HDFS reads and writes.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental evaluation</title>
      <p>In this section we report on preliminary results on the performance of the
pipeline. For these experiments we used 6 exomes, from anonymised patients
obtained from the Institute of Genetic Medicine at Newcastle University. These
samples come with naturally slightly di erent sizes. Our samples sizes are in the
range 10.8GB-15.9GB, with average 13.5GB (compressed). Using these samples,
we analysed the runtime of the pipeline implementation described in Sec. 2,
comparing the deployment modes described in the previous section, namely a
single-node Spark model, known as \pseudo-cluster" mode, with a cluster mode
con guration with up to four nodes. In both cases, all nodes are identical virtual
machines on the Azure cloud with 8 cores, 55GB RAM. Our experiments aim
to compare the e ect of various Spark settings for each of these con gurations.</p>
      <p>We focused exclusively on the pre-processing phase, where the bulk of the
processing occurs. Speci cally, BWA alignment and duplicate marking (denoted
BWA/MD in the following) accounts for 38% of the processing time, Base
Quality Score Recalibration Processing (BQSRP) for 11%, and variant calling using
the Haplotype Caller (HC) 39%. The rest of the pipeline, which only accounts
for 12% of the processing, was not considered further in these experiments.</p>
      <p>Four settings were used to tune the Spark con guration, indicated in the
charts as X/Y/W/Z, where X is the driver process memory, Y the number of
executors, W the number of cores allocated to each executor, and Z the memory
allocated to each executor.</p>
      <p>Charts 4(a) and 4(b) show the processing for two con gurations: 20/2/4/16
and 20/4/2/8 respectively, for each of the six samples (ordered by size) and with
a breakdown for each pre-processing tool. Both charts show a slight increase
in processing time as the sample size increases (with an unexplained anomaly
on the 13GB sample in both cases). These times are not signi cantly a ected
by the di erences in con guration. Indeed, if we normalise the processing time
by the input size, we observe very similar gures across the two con gurations
and for each tool, as shown in Fig. 5(a). Speci cally, for the two con gurations
BWA/MD, BQSRP, and HC report an average of 19.3 vs 18.4 minutes/GB, 5.6
vs 5.3 minutes, and 20.2 vs 19.14 minutes, respectively.</p>
      <p>For a deeper analysis on the e ect of Spark settings, we then ran the pipeline
on one single representative sample (PFC 0028, 14.2GB) on two additional
settings, 10/4/2/8, and 10/8/1/6. Fig. 5(b) shows the results, with processing times
normalised by sample size for ease of comparison with the previous chart. Again,
there is no indication that these four settings are critical in a ecting the
processing times.</p>
      <p>More signi cant is the di erence in processing time achieved by adding
resources to the VMs. Fig. 6(a) shows a nearly ideal speedup as we double the
number of cores (with a constant 55GB RAM per 8 cores, i.e. 110GB for 16
cores). It seems, however, that the Spark tools do not bene t from a larger VM
beyond 16 cores. Note that the chart in Fig. 6(a) does not include the processing
time for HC, as this took an unusually long time to run on a 16 cores con
guration. This was due to an issue with a low-level library on the HC implementation,
which was not resolved at the time of writing.</p>
      <p>BWA/MD + BQSRP</p>
      <p>BWA/MD + BQSRP BWA/MD BQSRP
8 16 32</p>
      <p>number of cores
(a) single node (55GB RAM/core)
1
2
3</p>
      <p>4
number of nodes
(b) 8 cores/55GB RAM cluster mode</p>
      <p>As expected, running Spark in the cluster mode shows a speedup as we
increase the number of nodes, as shown in Fig. 6(b). However, we also note that
scaling out, that is, by adding nodes, may incur an overhead that makes it less
e cient than scaling up (i.e. adding cores and memory to a single node con
guration). For instance, 2 nodes with 8 cores each process at 229 minutes, while
a single node with 16 cores takes 165 minutes. This overhead is less noticeable
when using 32 cores, which as we noted earlier does not improve processing time
on a single host (175 minutes, Fig. 6(a)), while a 4x8 nodes cluster takes 137
minutes, a further improvement over the other con gurations.
3.1</p>
      <p>Comparing with Microsoft Genomics Services
Thanks to a grant from Microsoft Azure Research, we were able to process
our patient samples using the new Microsoft Genomics Services. These services
execute precisely the pre-processing steps of the pipeline, making it easier to
compare with our results. The processing time for our reference PFC 0028
sample is an impressive 77 minutes (compare with the best time of 446 minutes on
a single node, obtained from the gures in Fig. 6(a)) to which the average HC
processing time has been added). However, at the time of writing these services
were only o ered as a black box that runs on a single, high-end virtual machine
of undisclosed speci cations. In terms of pricing, the current charges for using
Genomics Services are $0.217 / GB, which translates to about $18.61 for
processing our six samples. For comparison, the cost of processing the same samples
using our pipeline with a 8 cores, 55GB con guration is estimated at $28.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We have presented an experimental evaluation of the design e ort involved in
implementing a genomics variant discovery pipeline using the recently released
GATK Spark tools from the Broad Institute, and a performance analysis based
on a single node and small cluster con guration. Our analysis is preliminary, as
the GATK 4.x tools are still very recent, non-GATK tools or those that have not
yet been ported represent bottlenecks. Firstly, because they run in centralised
mode, and secondly because of the di erent le infrastructure they require. Also,
Spark tools appear to be designed in isolation, without attempting to eliminate
intermediate data passing through HDFS reads and writes.</p>
      <p>Compared with the processing times reported for the Microsoft Azure
Genomics Services, it appears that using Spark with the recent beta version of
GATK tools is currently not economically competitive and thus is not
recommended for operational use in clinical settings. This may change, however, as the
GATK Spark tools mature. On the plus side, our implementation o ers complete
control over the evolution of the pipeline over time, a key requirement especially
in a genetic research setting.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The authors are grateful to Microsoft for the Azure for Research grant that made
it possible to experiment with Azure Genomics Services.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cala</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marei</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Missier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Scalable and E cient Wholeexome Data Processing Using Work ows on the Cloud</article-title>
          .
          <source>Future Generation Computer Systems</source>
          , Special Issue:
          <article-title>Big Data in the Cloud 65(Special Issue: Big Data in the Cloud) (</article-title>
          <year>2016</year>
          ), http://dx.doi.org/10.1016/j.future.
          <year>2016</year>
          .
          <volume>01</volume>
          .001
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hood</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friend</surname>
            ,
            <given-names>S.H.</given-names>
          </string-name>
          :
          <article-title>Predictive, personalized, preventive, participatory (P4) cancer medicine</article-title>
          .
          <source>Nature reviews Clinical oncology 8(3)</source>
          ,
          <volume>184</volume>
          (
          <year>2011</year>
          ), http://dx. doi.org/10.1038/nrclinonc.
          <year>2010</year>
          .227
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Durbin</surname>
          </string-name>
          , R.:
          <article-title>Fast and accurate long-read alignment with BurrowsWheeler transform</article-title>
          .
          <source>Bioinformatics</source>
          (Oxford, England)
          <volume>26</volume>
          (
          <issue>5</issue>
          ),
          <volume>589</volume>
          { 95 (mar
          <year>2010</year>
          ). https://doi.org/10.1093/bioinformatics/btp698, http: //www.pubmedcentral.nih.gov/articlerender.fcgi?artid=
          <volume>2828108</volume>
          {&amp;}tool= pmcentrez{&amp;}rendertype=abstract
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>High-performance genomic analysis framework with in-memory computing</article-title>
          .
          <source>SIGPLAN Not</source>
          .
          <volume>53</volume>
          (
          <issue>1</issue>
          ),
          <volume>317</volume>
          {328 (Feb
          <year>2018</year>
          ). https://doi.org/10.1145/3200691.3178511, http://doi.acm.
          <source>org/10</source>
          .1145/ 3200691.3178511
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mashl</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scott</surname>
            ,
            <given-names>A.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>K.l.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wyczalkowski</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DeNardo</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Yellapantula</surname>
            ,
            <given-names>V.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Handsaker</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koboldt</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feny</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raphael</surname>
            ,
            <given-names>B.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wendl</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Genomevip: a cloud platform for genomic variant discovery and interpretation</article-title>
          .
          <source>Genome Research</source>
          <volume>27</volume>
          (
          <issue>8</issue>
          ),
          <volume>1450</volume>
          {
          <fpage>1459</fpage>
          (
          <year>2017</year>
          ). https://doi.org/10.1101/gr.211656.116, http://genome.cshlp.org/ content/27/8/1450.abstract
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mushtaq</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Ars</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline</article-title>
          .
          <source>In: Bioinformatics and Biomedicine (BIBM)</source>
          ,
          <year>2015</year>
          IEEE International Conference on. pp.
          <volume>1471</volume>
          {
          <fpage>1477</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evani</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abhyankar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Howarth</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            <given-names>Priol</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Bloom</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Massively parallel processing of whole genome sequence data: an in-depth performance study</article-title>
          .
          <source>In: Proceedings of the 2017 ACM International Conference on Management of Data</source>
          . pp.
          <volume>187</volume>
          {
          <fpage>202</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Siretskiy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sundqvist</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voznesenskiy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spjuth</surname>
            ,
            <given-names>O.:</given-names>
          </string-name>
          <article-title>A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data</article-title>
          .
          <source>GigaScience 4</source>
          ,
          <issue>26</issue>
          (
          <year>2015</year>
          ). https://doi.org/10.1186/s13742-015-0058-5
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Price</surname>
            ,
            <given-names>N.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hood</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Systems cancer medicine: towards realization of predictive, preventive, personalized and participatory (P4) medicine</article-title>
          .
          <source>Journal of Internal Medicine (271)</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Van der Auwera</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carneiro</surname>
            ,
            <given-names>M.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartl</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poplin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Angel</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>LevyMoonshine</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shakir</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roazen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thibault</surname>
          </string-name>
          , J.:
          <article-title>From FastQ data to highcon dence variant calls: the genome analysis toolkit best practices pipeline</article-title>
          . Current protocols in bioinformatics pp.
          <volume>10</volume>
          {
          <issue>11</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hakonarson</surname>
          </string-name>
          , H.:
          <article-title>Annovar: functional annotation of genetic variants from high-throughput sequencing data</article-title>
          .
          <source>Nucleic Acids Research</source>
          <volume>38</volume>
          (
          <issue>16</issue>
          ),
          <year>e164</year>
          (
          <year>2010</year>
          ). https://doi.org/10.1093/nar/gkq603, http://dx.doi.org/10.1093/ nar/gkq603
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wiewiorka</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Messina</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pacholewska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Ma oletti, S.,
          <string-name>
            <surname>Gawrysiak</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Okoniewski</surname>
            ,
            <given-names>M.J.:</given-names>
          </string-name>
          <article-title>SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision</article-title>
          .
          <source>Bioinformatics</source>
          (Oxford, England)
          <volume>30</volume>
          (
          <issue>18</issue>
          ),
          <volume>2652</volume>
          {2653 (sep
          <year>2014</year>
          ). https://doi.org/10.1093/bioinformatics/btu343
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>