<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scalability Potential of BWA DNA Mapping Algorithm on Apache Spark</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zaid Al-Ars Hamid Mushtaq</string-name>
          <email>z.al-ars@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Engineering Lab, Delft University of Technology 2628 CD Delft</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <fpage>85</fpage>
      <lpage>88</lpage>
      <abstract>
        <p>This paper analyzes the scalability potential of embarrassingly parallel genomics applications using the Apache Spark big data framework and compares their performance with native implementations as well as with Apache Hadoop scalability. The paper uses the BWA DNA mapping algorithm as an example due to its good scalability characteristics and due to the large data files it uses as input. Results show that simultaneous multithreading improves the performance of BWA for all systems, increasing performance by up to 87% for Spark on Power7 with 80 threads as compared to 16 threads (# of physical cores). In addition, Hadoop has slightly better performance of up to 17% for low system utilization, while Spark has up to 27% better performance for high system utilization. Furthermore, Spark is able to sustain high performance when the system is over-utilized, while the performance decreases for Hadoop as well as the native implementation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the fast increase in the sizes of genomics datasets
and the growing throughput of DNA sequencing
machines, there is an urgent need to develop scalable,
high-performance computational solutions to address
these challenges. A number of different approaches
are being investigated as possible solutions, ranging
from highly connected, customized server-based
solutions (Kelly15) to Hadoop-based big data
infrastructures (Decap15). Predominantly, however, classical
computer cluster-based solutions are the most widely
used computational approach, either used locally or in
the cloud (Stein10).</p>
      <p>Each of these solutions has advantages and
disadvantages as it relates to the scalability potential on
large computer infrastructures. Customized solutions
are expensive to design, but have the advantage of
being highly optimized for the specific analysis pipeline
being performed. Classical cluster-based solutions are
more generic making them less costly, but also less
effective. Genomics analysis problems, however, have
the potential of offering a huge amounts of parallelism
by segmenting the large input datasets used in the
analysis. This creates the opportunity of using recent big
data solutions to address these analysis pipelines.</p>
      <p>In this paper, we investigate the scalability potential
of using Apache Spark to genomics problems and
compare its performance to both the native scalable
implementation developed for classical clusters, as well as
to Hadoop-based big data solutions (Zaharia12). This
analysis is performed using a popular DNA mapping
algorithm called the Burrow-Wheeler Aligner
(BWAMEM), which is known for its speed and high
scalability potential on classical clusters (Li13).</p>
      <p>This paper is organized as follows. Section 2
discusses typical genomics analysis pipelines and the
different stages of the analysis. Section 3 presents the
BWA-MEM algorithm, evaluates its scalability
potential and discusses some of its computational
limitations. Section 4 presents and compares the
scalability potential of native implementations, Hadoop and
Spark on the used test computer systems. Section 5
evaluates the effectiveness of these different solutions
for genomics analysis by evaluating the scalability of
BWA-MEM. Section 6 ends with the conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Genomics pipelines</title>
      <p>In this section, we discuss the basic steps of a so-called
reference-based DNA analysis pipeline, used for
analyzing the mutations in a DNA dataset using a known
reference genome.</p>
      <sec id="sec-2-1">
        <title>DNA sequencing</title>
        <p>The first step in any genome analysis pipeline starts by
acquiring the DNA data using sequencing machines.
This is done in a massively parallel fashion with
millions of short DNA pieces (called short reads) being
sequenced at the same time. These reads are stored in
large files of sizes ranging from tens to hundreds of
gigabytes. One standard file format used today to store
these reads is called the FASTQ file format (Jones12).</p>
      </sec>
      <sec id="sec-2-2">
        <title>Read mapping</title>
        <p>The second step is used to assemble the short reads into
a full genome, by mapping the short reads to a
reference genome. This is one of the first computational
steps in any genomics analysis pipeline, needed to
reconstruct the genome. At the same time, it is one of the
most computationally intensive steps, requiring a lot of
CPU time. The output represents a mapping of the
possible locations of a specific read in the FASTQ file to
a specific reference genome. BWA-MEM is one of the
most widely used DNA mapping programs (Li13).</p>
      </sec>
      <sec id="sec-2-3">
        <title>Variant calling</title>
        <p>The third step is called variant calling, which uses
algorithms to identify the variations of a mutated DNA
as compared to a reference DNA. This analysis is
becoming standard in the field of genomics diagnostics,
where DNA analysis is used to advise clinical decision.
The Genome Analysis Toolkit (GATK) and SAMtools
are two widely used software tools for variant
calling (Pabinger13).
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>BWA mapping algorithm</title>
      <p>BWA-MEM is one of the most widely used DNA
mapping algorithms that ensures both high throughput and
scalability of the large datasets used in genomics. This
section discusses the BWA-MEM algorithm which we
use as an example for parallel algorithms used in big
data application domains.</p>
      <p>BWA-MEM, as well as many other DNA mapping
algorithms, is based on the observation that two DNA
sequences of the same organism are likely to contain
short highly matched subsequences. Therefore, they
can follow a strategy which consists of two steps: 1)
seeding and 2) extension. The seeding step is to first
locate the regions within the reference genome where a
subsequence of the short read is highly matched. This
subsequence is known as a seed, which is an exact
match to a subsequence in the reference genome. After
seeding, the remaining read is aligned to the reference
genome around the seed in the extension step using the
Smith-Waterman algorithm (Houtgast15).</p>
      <p>In BWA-MEM, before starting the read alignment,
an index of the reference genome is created. This is a
one time step and hence not on the critical execution
path. In our discussion, we assume that an index is
already present. The different execution stages of
BWAMEM read alignment algorithm are described below.
The first two stages belong to seeding.</p>
      <sec id="sec-3-1">
        <title>SMEM generation</title>
        <p>BWA-MEM first computes the so-called
supermaximal exact matches (SMEMs). An SMEM is a
subsequence of the read that is exactly matching in the
reference DNA and cannot be further extended in either
directions. Moreover, it must not be contained in
another match.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Suffix array lookup</title>
        <p>The suffix array lookup stage is responsible for locating
the actual starting position of the SMEM in the
reference genome. An SMEM with its known starting
position(s) in the reference genome forms seed(s) in the
reference.
Seeds are subsequences of the read that are exactly
matching in the reference genome. To align the whole
read against the reference genome, these seeds are
extended in both directions. This extension is
performed using a dynamic programming algorithm based
on Smith- Waterman.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Output</title>
        <p>The read alignment information is written to a file in
the SAM (sequence alignment/map) format (Li09).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Baseline performance analysis</title>
      <p>The increasing sizes of big data files have called for
a continued effort to develop systems capable of
managing such data sizes and enabling the needed
scalability to process them efficiently. Hadoop MapReduce
is the current industry system of choice for big data
systems, which uses the in-disk Hadoop distributed file
system (HDFS). Apache Spark is emerging as a strong
competitor in the field due to its in-memory resilient
distributed datasets. This section compares these two
systems using the WordCount benchmark to identify a
baseline for the BWA-MEM implementation.</p>
      <p>We tested the WordCount application on two
different kinds of machines. The first one is an IBM
PowerLinux 7R2 with two Power7 CPUs and 8 physical
cores each. The Power7 cores are capable of executing
4 simultaneous threads. The second machine is an
Intel Linux server with two Xeon CPUs and 10 physical
cores each. The Xeon cores are capable of executing 2
simultaneous threads.</p>
      <p>The results for the WordCount application are shown
for the IBM Power7 and Intel Xeon in Figure 1 and 2,
respectively. For the Hadoop version, the input files
are read from the HDFS file system, while for Spark,
the files are read from the local file system. We can
see that on both machines, the Spark version is faster
than the Hadoop version. Moreover, there is an
approximately constant increase in the execution time of
Native
Hadoop
Spark
25</p>
      <p>Cores
5
10
15
20
30
35
40
45
50
10
20
30
50
60
70</p>
      <p>80
Hadoop as compared to that of Spark for the different
thread count on both machines. This indicates a
constant added overhead for Hadoop-based programs over
Spark for different number of threads. This overhead is
partly incurred by Hadoop having to access files from
the HDFS system instead of the local file system.</p>
      <p>The figures also show that the highest performance
gains are achieved using scalability of the physical
cores (up to 16 threads for the Power7 and 20 threads
for the Xeon). In addition, high performance gains
are achieved by running 2 threads per core, with the
Power7 achieving better performance gains as
compared to the Xeon. Further increase in the thread count
on the Power7 only achieves marginal gains. One
interesting remark is that on both machines, over-saturating
the CPUs by issuing more threads than the machine
is capable of causes Hadoop to loose performance
slightly, while Spark is still capable of a (marginal)
increase in performance.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental results</title>
      <p>In this section, we investigate the scalability
potential of BWA-MEM on big data systems as an example
of genomics data processing pipelines. We compare
its native implementation, developed for classical
clusters, to the performance of an Apache Spark as well
as of a Hadoop-based big data implementation. The
Hadoop version uses the Halvade scalable system with
a MapReduce implementation (Decap15). In this
evaluation, we use the same IBM PowerLinux 7R2 and
Intel Xeon servers we used for the WordCount example
above.</p>
      <p>The results for the BWA mapping are shown for
IBM Power7 and Intel Xeon in Figure 3 and 4,
respectively. The results are shown for 3 different version of
BWA: 1. native, 2. Hadoop, and 3. Spark. Both the
Hadoop and Spark versions divide the input dataset of
short reads into a number of smaller files referred to as
chunks. For example, for these experiments, we had 32
input chunks. Halvade and Spark were run with 4
instances, which means that 4 simultaneous BWA tasks
were run in parallel on different data chunks. The
number of threads used in each experiment was varied by
controlling the number of threads issued by each BWA
instance (using the -t command line argument). For
example, to use 12 threads, we used 4 instances with
each instance having 3 threads, while to use 32 threads,
we used 4 instances with 8 threads each.</p>
      <p>It also has to be mentioned here that there are few
differences in how files are read in Hadoop and the
Spark versions. In the Hadoop version, the input zipped
files are extracted on the fly by the Hadoop
MapReduce framework and delivered to the mappers line by
line, where each mapper is executing one instance of
BWA. Each instance then processes the input data line
by line. That data is read right away by the BWA
instances using the stdin buffer. On the other hand, each
instance in the Spark version reads a zipped input file
and then decompresses it first. Afterwards, the
decompressed file is forwarded as an input to a BWA instance.
However, in both cases, the output is written to the
local file system. It is also important to mention here that
in the Power7 case, we wrote the output into a RAM
disk for all three BWA versions. This is because we
wanted to know how good BWA scales with
simultaneous threads without letting file I/O overshadowing the
execution time.</p>
      <p>On both the Power7 and Xeon machines, we can see
that simultaneous multithreading improves the
performance of BWA. This is because the BWA algorithm
usually suffers from a large number of memory stalls,
as a result of random accesses to the memory that
renders the cache ineffective causing a high cache miss
rate. These cache misses can be reduced by
simultaneous multithreading, since some threads can run while
others are stalled. The Power7 system is able to make
significant performance gains using its capability to
issue 4 simultaneous threads. Spark is able to increase in
performance by up to 87% with 80 threads as compared
to 16 threads (# of physical cores).</p>
      <p>One interesting result in the figures is that while the
Hadoop version is faster by up to 17% using lower
number of threads, the Spark version gets faster by
up to 27% at higher number of threads. This
behavior can be explained by the way the input chunk files
are handled. As mentioned before, the Hadoop
version uses on-the-fly decompression of the zipped input
chunk files, while the Spark version first decompresses
a zipped input chunk file before using it. This causes
an increased overhead that makes Spark run slower for
lower number of threads. As the number of threads
increases, Spark improves in performance and overtakes
Hadoop in execution time.</p>
      <p>Finally, the figures show that over saturating the
thread capacity of the cores (i.e., issuing more threads
than number of virtual cores available) causes the
performance of the native version and the Hadoop version
to degrade, while Spark is able to continue to improve
in performance. This behavior is similar to the one
observed in the WordCount example. Therefore, this
could be caused by the internal implementation of the
Spark and Hadoop systems themselves.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>This paper analyzed the scalability potential of the
widely used BWA-MEM DNA mapping algorithm.
The algorithm is embarrassingly parallel and can be
used as an example to identify the scalability potential
of embarrassingly parallel genomics applications using
the Spark and Hadoop big data frameworks. The paper
compared the performance of 3 BWA implementations:
1. a native cluster-based version, 2. a Hadoop version,
and 3. a Spark versions. Results show that
simultaneous multithreading improves the performance of BWA
for all systems, increasing performance by up to 87%
for Spark on Power7 with 80 threads as compared to 16
threads (# of physical cores). The results also show that
while the Hadoop version is faster by up to 17% using
4 threads, the Spark version gets faster by up to 27%
at higher number of threads. Finally, the results also
indicate that the Spark system is more capable of
handling higher number of threads as it is able to continue
to reduce its run time when over-saturating the thread
capacity of the cores.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Decap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reumers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Herzeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Costanza</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Fostier</surname>
          </string-name>
          , ”
          <article-title>Halvade: scalable sequence analysis with MapReduce”</article-title>
          , Bioinformatics, btv179v2-
          <fpage>btv179</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>E.J.</given-names>
            <surname>Houtgast</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-M. Sima</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Bertels</surname>
            and
            <given-names>Z. AlArs,</given-names>
          </string-name>
          ”
          <article-title>An FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm”</article-title>
          ,
          <source>International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>D.C.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.L.</given-names>
            <surname>Ruzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Peng</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.G.</given-names>
            <surname>Katze</surname>
          </string-name>
          , ”
          <article-title>Compression of next-generation sequencing reads aided by highly efficient de novo assembly”</article-title>
          ,
          <source>Nucleic Acids Research</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>B.J.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.R.</given-names>
            <surname>Fitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.J.</given-names>
            <surname>Corsmeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.N.</given-names>
            <surname>Wetzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.D.</given-names>
            <surname>Nordquist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.L.</given-names>
            <surname>Newsom</surname>
          </string-name>
          and
          <string-name>
            <surname>P. White,</surname>
          </string-name>
          ”
          <article-title>Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics”</article-title>
          ,
          <source>Genome Biology</source>
          , vol.
          <volume>16</volume>
          , no.
          <issue>6</issue>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          , ”
          <article-title>Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM”</article-title>
          , arXiv:
          <fpage>1303</fpage>
          .
          <article-title>3997 [q-bio</article-title>
          .
          <source>GN]</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Handsaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wysoker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fennell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Homer</surname>
          </string-name>
          , G. Marth, G. Abecasis, R. Durbin, ”
          <article-title>The Sequence Alignment/Map format</article-title>
          and SAMtools”, Bioinformatics, vol.
          <volume>25</volume>
          , no.
          <issue>16</issue>
          , pp.
          <fpage>2078</fpage>
          -
          <lpage>2079</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Pabinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Snajder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sperk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Efremova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krabichler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.R.</given-names>
            <surname>Speicher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zschocke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Trajanoski</surname>
          </string-name>
          , ”
          <article-title>A survey of tools for variant analysis of next-generation genome sequencing data”, Brief Bioinformatics</article-title>
          ,
          <fpage>bbs086v1</fpage>
          -
          <lpage>bbs086</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>L.D. Stein</surname>
          </string-name>
          , ”
          <article-title>The case for cloud computing in genome informatics”</article-title>
          ,
          <source>Genome Biology</source>
          , vol.
          <volume>11</volume>
          , no.
          <issue>207</issue>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dave</surname>
            , J. Ma,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>McCauley</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shenker</surname>
          </string-name>
          , I. Stoica, ”
          <article-title>Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”</article-title>
          , NSDI,
          <year>April 2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>