<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Memory Efficient Processing of DNA Sequences in Relational Main-Memory Database Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Variant Calling, Invisible Join, Array-based Aggregation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Database</institution>
          ,
          <addr-line>Genome Analysis</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sebastian Dorok Bayer Pharma AG Otto-von-Guericke-University Magdeburg Institute for Technical and Business Information Systems Magdeburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>39</fpage>
      <lpage>43</lpage>
      <abstract>
        <p>Pipeline breaking operators such as aggregations or joins require database systems to materialize intermediate results. In case that the database system exceeds main memory capacities due to large intermediate results, main-memory database systems experience a massive performance degradation due to paging or even abort queries. In our current research on efficiently analyzing DNA sequencing data using main-memory database systems, we often face the problem that main memory becomes scarce due to large intermediate results during hash join and sort-based aggregation processing. Therefore, in this paper, we discuss alternative join and aggregation techniques suited for our use case and compare their characteristics regarding memory requirements during processing. Moreover, we evaluate different combinations of these techniques with regard to overall execution runtime and scalability to increasing amounts of data to process. We show that a combination of invisible join and array-based aggregation increases memory efficiency enabling to query genome ranges that are one order of magnitude larger than using a hash join counterpart in combination with sort-based aggregation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Recent research on efficiently querying DNA sequencing
data using relational database systems focuses on using
mainmemory technology [
        <xref ref-type="bibr" rid="ref11 ref5 ref9">5, 9, 11</xref>
        ]. The goal is to benefit from
the increased analysis performance and to enable research
departments and scientists to perform analyses of DNA
sequencing data efficiently and in a scalable way in a relational
database management system. Such an approach would
eliminate several data management issues within genome
analysis such as result traceability and repeatability as data
can be analyzed from end-to-end in a database system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The size of single genome data sets varies between
several hundred to thousand megabytes. Besides varying data
set size, also the analysis target varies between querying
small genome regions such as genes and bulk processing
complete genomes. With increasing data sizes to analyze,
larger amounts of tuples have to be processed limiting the
applicability of main-memory database systems as
pipelinebreaking operators such as aggregations and joins require the
database system to materialize intermediate results. Such
result materialization accompanies materialization costs and
increases main memory consumption within main-memory
database systems. During analysis of DNA sequencing data,
large amounts of tuples have to be processed due to low
selectivity. Thus, it is quite common that the intermediate
result materialization exceeds available main memory
leading to a massive performance degradation of the system.</p>
      <p>
        Current work in the field of genome analysis using
mainmemory database systems does not address this scalability
issue explicitly. Fa¨hnrich et al. show that main-memory
technology outperforms highly tuned analysis tools [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
authors examine the scalability of their approach with
regard to multiple threads but not regarding data size. Civjat
et al. use the main-memory database system MonetDB to
process DNA sequencing data but do not provide insights
into the scalability of their approach regarding data size [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>In this work, we focus on a base-centric representation
of DNA sequences and investigate different join and
aggregation techniques with regard to their overall runtime and
scalability to increasing genome ranges to query.</p>
      <p>The remainder of the paper is structured as follows. In
Section 2, we provide basic information about genome
analysis, our used database schema and the executed query type.
In Section 3, we discuss different join and aggregation
processing techniques from the perspective of our use case. In
Section 4, we compare the different combinations of the
discussed processing techniques and evaluate their overall
runtime and scalability with regard to different data sizes to
process. In Section 5, we discuss related work.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>In this section, we explain the basic steps of genome
analysis. Then, we explain how we can perform variant calling,
a typical genome analysis task, using SQL.
DNA Sequencing</p>
      <p>Read Mapping</p>
      <p>Variant Calling</p>
      <p>A basic genome analysis pipeline consists of two main
steps: (1) DNA sequencing &amp; read mapping and (2)
analysis of DNA sequencing data. In Figure 1, we depict the
two main steps. In the following, we briefly introduce both
analysis steps.
2.1.1</p>
      <sec id="sec-2-1">
        <title>DNA sequencing &amp; read mapping</title>
        <p>
          DNA sequencing makes genetic information stored in DNA
molecules digitally readable. To this end, DNA sequencers
convert DNA molecules into sequences of the characters A,
C, G, and T. Every single character encodes a nucleobase
making up DNA molecules. As DNA sequencing is
errorprone, every base is associated with an error probability that
indicates the probability that the base is wrong [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The
sequences of characters are called reads each associated with
a so called base call quality string encoding the error
probabilities of all bases in ASCII. DNA sequencing techniques
are only able to generate small reads from a given DNA
molecule [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Thus, these small reads must be assembled
to reconstruct the original DNA. Common tools to
assemble reads are read mappers [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Read mapping tools utilize
known reference sequences of organisms and try to find the
best matching position for every read. Read mappers have
to deal with several difficulties such as deletions, insertions
and mismatches within reads. Therefore, read mappings are
also associated with an error probability.
2.1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Variant calling</title>
        <p>
          Usually, scientists are not interested in mapped reads but
in variations within certain ranges of the genome such as
genes or chromosomes. Genome positions that differ from a
given reference are called variants. Variants of single bases
are a special class of variants and called Single nucleotide
polymorphisms (SNPs). Detecting SNPs is of high interest
as these are known to trigger diseases such as cancer [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
The task of variant calling is to first decide on a genotype
that is determined by all bases covering a genome site. Then,
the genotype is compared against the reference base. Two
general approaches exist to compute the genotype [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. One
idea is to count all appearing bases at a genome site and
decide based on the frequency which genotype is present. More
sophisticated approaches incorporate the available quality
information to compute the probabilities of possible
genotypes. Overall, the computation of genotypes is an
aggregation of bases that are mapped to the same genome site.
2.2
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Variant calling in RDBMS</title>
      <p>In this section, we briefly introduce the base-centric
storage database schema to store DNA sequencing data in the</p>
      <p>Reference
RG_ID OID
RG_NAME VARCHAR
sort order
primary 1
secondary 2</p>
      <p>Sample
SG_ID OID
SG_NAME VARCHAR</p>
      <p>C_ID
C_NAME
C_RG_ID</p>
      <p>Contig</p>
      <p>OID
VARCHAR</p>
      <p>OID</p>
      <p>Read
R_ID OID
R_QNAME VARCHAR
R_MAPQ INT
R_SG_ID OID</p>
      <p>Reference_Base
RB_ID OID
RB_BASE_VALUE CHAR
RB_POSITION 2 INT
RB_C_ID 1 OID</p>
      <p>Sample_Base
SB_ID OID
SB_BASE_VALUE CHAR
SB_BASE_QUALITY CHAR
SB_INSERT_OFFSET INT
SB_RB_ID 2 OID
SB_READ_ID 1 OID
form of mapped reads. Moreover, we explain how we can
integrate variant calling, in particular SNP calling, into a
relational database system via SQL to reveal the critical
parts of the query.
2.2.1</p>
      <sec id="sec-3-1">
        <title>Database schema</title>
        <p>
          To represent mapped reads, we proposed the base-centric
database schema [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It explicitly encodes the connection
between single bases. Thus, it allows for direct access to all
data via SQL that is required for base-centric analyses such
as SNP calling. The schema consists of 6 tables. We depict
it in Figure 2. Every Reference consists of a set of regions
called Contigs. A contig can be a chromosome that consists
of single Reference_Bases. For example, chromosome 1 of a
human genome consists of nearly 250 million bases. A
Sample consists of several mapped Reads which consist of single
Sample_Bases. For example, the low coverage genome of
the sample HG00096 provided by the 1000 genomes project
consists of nearly 14 billion bases1. Every Sample_Base is
mapped to one Reference_Base which is encoded using a
foreign-key relationship.
2.2.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>SNP calling via SQL</title>
        <p>The base-centric database schema enables direct access
to every single base within a genome. We can use the
SQL query in Listing 1 to compute the genotype using a
user-defined-aggregation function and finally call variants by
comparing the result of the aggregation with the
corresponding reference base. The query processing can be split into
two separate phases:
1. Filter &amp; Join. In the first phase, we select bases of
interest that are not inserted, have high base call quality and
belong to reads with high mapping quality (cf. Lines 10–12)
by joining the required tables (cf. Lines 6–8).
2. Aggregate &amp; filter. Finally, the genotype aggregation
is performed using a user-defined aggregation function.
Afterwards, the aggregation result is compared to the reference
base to check whether a SNP is present or not (cf. Line 16).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>MEMORY EFFICIENT SNP CALLING</title>
      <p>In Section 2, we explained that variant calling within a
relational database system consists of two main phases: filter
&amp; join and aggregate &amp; filter. In this section, we discuss
different implementations to execute these phases of the query.
1ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/
data/HG00096/
1 s e l e c t
2 C NAME, RB POSITION , RB BASE VALUE,
3 genotype (SB BASE VALUE) as genotype ,
4 count (SB BASE VALUE) as c o v e r a g e
5 from
6 Contig j o i n R e f e r e n c e B a s e on C ID = RB C ID
7 j o i n Sample Base on RB ID = SB RB ID
8 j o i n Read on R ID = SB READ ID
9 where
10 C NAME = ’ ? ’ and SB INSERT OFFSET = 0
11 and SB BASE CALL QUALITY &gt; ?
12 and R MAPQ &gt; ?
13 group by
14 C NAME, RB POSITION , RB BASE VALUE
15 having
16 RB BASE VALUE &lt;&gt; genotype and
17 c o v e r a g e &gt; ? and
18 RB BASE VALUE &lt;&gt; ’N ’ and genotype &lt;&gt; ’X ’
19 o r d e r by
20 C NAME, RB POSITION ;
Listing 1: SNP calling via SQL. A custom
aggregation function is used to compute the domain-specific
aggregate genotype.</p>
      <p>In particular, we concentrate on the implementation of the
join and aggregation. Moreover, we assess the alternative
strategies according to their memory consumption that we
want to reduce to improve the memory efficiency and
inherently the scalability of queries to larger genome ranges.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Filter &amp; join</title>
      <p>Within relational database systems, joins are one of the
most expensive operations. In this work, we use a hash join
as baseline and compare it with the invisible join technique.</p>
      <p>Hash join. The hash join approach consists of a build
and probe phase. The idea is to index the tuples of one
relation using a hash table. Afterwards, the join pairs are
computed by probing the tuples of the second relation. One
source of high memory consumption are the hash tables
created during join processing, especially if the joined tables
are large. To reduce the table size, the query optimizer
pushes down selections before joining tables. Then, the
filtered tables will be joined. Thus, before computing the join,
all join columns are materialized. Consequently, predicates
with low selectivity lead to high memory consumption.</p>
      <p>
        Invisible join. An alternative join technique is the
invisible join [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The basic idea is to use foreign-key
relationships and positional lookups to compute matching join
pairs. The technique was proposed in the context of
processing data warehouse schemata such as the star- or snowflake
schema. The base-centric database schema for storing DNA
sequencing data is a star schema. Every primary key
column starts at 0 and following primary keys are incremented.
Hence, this setup enables using foreign keys on such primary
key columns as row index to any column of the primary
key table which can be efficiently implemented in column
stores. Knowing the required tuples of the fact table, i.e.
Sample_Base, we can simply gather all required data for
downstream query processing using the foreign key columns
SB_READ_ID and SB_RB_ID.
      </p>
      <p>In order to make this join processing strategy efficient,
we have to be able to apply between-predicate rewriting to
express predicates on dimension tables as between
predicates on the fact table. We can apply
between-predicaterewriting to rewrite the predicate on C_NAME as between
predicate on SB_RB_ID as we can guarantee that
Reference_Bases are sorted by RB_C_ID and RB_POSITION.
After applying the rewritten between predicate, we reduce the
number of selected Sample_Base tuples further in a
semijoin fashion. Finally, we compute the join starting at the
fact table Sample_Base gathering all corresponding tuples
from tables Read and Reference_Base. Thus, the invisible
join keeps the memory footprint low as only the tuple
identifiers of table Sample_Base are materialized and pruned.</p>
      <p>Based on this qualitative discussion, we expect that the
invisible join has less memory overhead than using a hash
join. Moreover, the invisible join avoids building and
probing hash tables, which should lead to further performance
improvements.
3.2</p>
      <p>Two general approaches for computing aggregates are
sortbased and hash-based techniques. In this section, we
compare a sort-based approach with a special kind of hash-based
aggregation called array-based aggregation.</p>
      <p>Sort-based aggregation. Considering the query in
Listing 1, the grouping clause is a superset of the order by
clause. Thus, it seems to be beneficial to perform a
sortbased aggregation as it directly leads to the required sorting
of the result. Analyzing functional dependencies, we can
reduce the set of group-by columns to the single column
RB_ID. Thus, we can replace the group-by clause C_NAME,
RB_POSITION and RB_BASE_VALUE in the SNP calling query
with RB_ID. For that reason, we only have to sort a single
column. Nevertheless, we have to reorder all columns that
are aggregated. This leads to materialization overhead and
additional memory consumption as the sorted and unsorted
columns must be kept in memory.</p>
      <p>
        Array-based aggregation. Knowing that the grouping
is done via column RB_ID, we can apply array-based
aggregation based on the idea by Krikellas et al. using mapping
directories for grouping attributes to compute offsets within
a grouping array [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. As RB_ID is a primary key, we can
also use it as array index for our aggregation, i.e. having a
perfect hash function. Thus, we do not need mapping
directories at all. Furthermore, we can guarantee that the RB_ID
column reflects the ordering given in the order-by clause,
thus, we inherently sort the data as required while we
aggregate it. In case, the selected RB_IDs do not start at 0, we
can determine the smallest RB_ID and use it as offset to fill
the array starting at the first index. Thus, the additional
memory consumption of this approach is determined by the
size of the array for keeping the intermediate aggregates.
      </p>
      <p>Based on this qualitative discussion, we expect that the
array-based aggregation has less memory overhead than
using a sort-based aggregation. Moreover, we expect that
avoiding to sort saves additional runtime.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Summary</title>
      <p>For each of the two phases, we presented two different
techniques and discussed their properties regarding
mainmemory consumption. We can combine these four approaches
leading to four different implementation stacks to execute
the same query. Based on our discussion, we expect that
the invisible join in combination with array-based
aggregation will require less memory than all other combinations.</p>
    </sec>
    <sec id="sec-7">
      <title>EVALUATION</title>
      <p>In this section, we evaluate the resulting four different
combinations of join and aggregation strategies regarding
their main-memory efficiency which we want to improve by
reducing the memory consumption. We expect that
reducing the memory consumption and, therefore, the
materialization overhead, the runtime performance will increase. To
this end, we measure the runtime when calling SNPs on a
complete genome.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Experimental setup</title>
      <p>
        As evaluation platform, we use a machine with two
Intel Xeon CPU E5-2609 v2 with four cores @2.5 GHz and
256 GB main memory. On the software side, we use Ubuntu
14.04.3 (64 Bit) as operating system. Before starting the
experiments, we pre-load the database into main memory. As
evaluation database system, we use CoGaDB [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a
columnoriented main-memory database system2.
      </p>
      <p>
        For our experiments, we use human genome data sets of
sample HG00096 from the 1000 genome project [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
4.2
      </p>
    </sec>
    <sec id="sec-9">
      <title>SNP calling runtime</title>
      <p>In our first experiment, we call SNPs on a complete
human genome that consists of ca. 14 billion sample bases.
We perform the SNP calling on single chromosomes in
order to not run out of memory as we are interested in the
overall runtime of the approaches. We report the runtimes
in Figure 3.</p>
      <p>We observe that the array-based aggregation approach
always reduces the runtime independent of the join approach.
Using array-based aggregation, we avoid expensive sorting
and reordering of columns as we can compute the required
order on-the-fly (cf. Section 3.2). In combination with the
invisible join, we can reduce the overall runtime of the join
and aggregation phases by factor three compared to the
slowest implementation using a hash join and sort-based
aggregation. The runtime reduction can be explained by less
overhead when computing the join as well as less
materialization of intermediate results. When using the hash join
implementation, all join columns are materialized before the
join. Additionally, the sort-based aggregation requires
sorting of the grouping column and reordering of aggregation
columns.
2http://cogadb.cs.tu-dortmund.de/wordpress/</p>
      <p>Invisible &amp; Array</p>
      <p>Hash &amp; Array</p>
      <p>Invisible &amp; Sort</p>
      <p>Hash &amp; Sort
106</p>
      <p>107
Genome range size
108</p>
      <p>In our second experiment, we investigate the memory
efficiency of all four implementation combinations by calling
SNPs on a high coverage chromosome 1. This dataset
consists only of chromosome 1 and contains ca. 11 billion sample
bases. Thus, per genome site ca. 50 bases have to be
aggregated. Using this data set, we can scale our query range from
1,000,000 genome sites up to 100,000,000. In case memory
becomes scarce, the system interrupts the query execution.
Thus, we can indirectly assess the memory efficiency of
every implementation combination. We report the results of
this experiment in Figure 4.</p>
      <p>In accordance to the first experiment, the invisible join
implementations are faster than the hash join
implementations. On smaller genome ranges the runtime difference is
up to one order of magnitude. The invisible join benefits
more from higher selectivity given when querying smaller
genome regions. The between-predicate-rewriting allows to
prune the fact table Sample_Base drastically leading to less
effort for gathering the join pairs.</p>
      <p>Considering the scalability of the four approaches, we
observe that using the invisible join, we can query genome
ranges that are one order magnitude larger than using a
hash join implementation at nearly the same speed. This
confirms our expectations from Section 3 regarding memory
efficiency of the invisible join. Additionally, the array-based
aggregation allows us to query slightly larger genome ranges
than using a sort-based aggregation until we run out of
memory. Overall, the impact of the used aggregation technique
on memory efficiency is smaller than those of the used join
technique. The hash join implementation always dominates
memory efficiency.
4.4</p>
    </sec>
    <sec id="sec-10">
      <title>Summary</title>
      <p>Our evaluation shows that for our use case, the invisible
join technique in combination with array-based aggregation
provides best memory efficiency and runtime performance.
Moreover, we observe a coherence between scalability to
larger genome ranges and processing runtime.</p>
    </sec>
    <sec id="sec-11">
      <title>RELATED WORK</title>
      <p>
        Within our work, we focus on using main-memory database
systems for genome analysis use cases. In previous work,
we mainly concentrated on expressing analysis tasks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and
showing the potential of such an approach [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Within
this work, we analyze the problem of memory scarcity
during processing limiting the scalability of our approach.
      </p>
      <p>
        We are aware of two other approaches that explicitly use
main-memory database techniques to speed up genome
analysis tasks. Fa¨hnrich et al. present an approach to integrate
SNP calling using a map-reduce like processing scheme [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
In contrast, our approach only uses relational database
operators. Moreover, within their evaluation, Fa¨hnrich et al. use
two computing nodes each having 1 TB of main memory for
processing a data set with ca. 17 billion sample bases. Thus,
the problem of memory scarcity during processing is unlikely
to appear. Cijvat et al. use MonetDB to analyze Ebola data
sets. Within their evaluation, they evaluate the runtime of
different analysis queries [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Our work complements these
approaches, in particular the MonetDB approach.
      </p>
      <p>
        State-of-the-art approaches to perform SNP calling use
specialized analysis tools that operate on flat files such as
GATK [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and samtools [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. These tools are designed to
efficiently operate in disk-based systems using machines with
small main memory. Therefore, both tools rely on input
data that is sorted by genomic region which is the
grouping predicate for the aggregation. Thus, when reading data
from disk, the tools can decide early when all data for a
genomic site is read to compute the aggregate and write the
result back to disk. Such an approach is not applicable in
a main-memory only approach as data and results reside in
memory. Moreover, in our approach, we have to perform
the join computation due to schema normalization before
we can aggregate the data.
6.
      </p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSION &amp; FUTURE WORK</title>
      <p>In this paper, we discuss different techniques for join and
aggregation processing in the context of analyzing DNA
sequencing data using main-memory database systems.
Choosing the optimal implementation combination increases
mainmemory efficiency significantly and hence, allows for
querying larger genome ranges or data sets.</p>
      <p>
        Future work. We are aware of improvements regarding
memory efficiency of in-memory hash joins [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Therefore,
in future work, it might be interesting to use such advanced
in-memory hash join techniques. Moreover, we have not
investigated the behaviour of sort-merge joins within this
work. A second direction in order to increase the scalability
of genome analysis tasks in main-memory database systems
is to reduce the size of the primary database using
lightweight compression, which frees main memory for
processing. Finally, partitioning strategies have to be investigated.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Hachem</surname>
          </string-name>
          .
          <article-title>Column-stores vs. row-stores: How different are they really? In SIGMOD</article-title>
          , pages
          <fpage>967</fpage>
          -
          <lpage>980</lpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Barber</surname>
          </string-name>
          , G. Lohman,
          <string-name>
            <given-names>I.</given-names>
            <surname>Pandis</surname>
          </string-name>
          , et al.
          <article-title>Memory-efficient hash joins</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>8</volume>
          (
          <issue>4</issue>
          ):
          <fpage>353</fpage>
          -
          <lpage>364</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Begley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.-P. P.</given-names>
            <surname>Chen. Mcjoin</surname>
          </string-name>
          :
          <article-title>A memory-constrained join for column-store main-memory databases</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <fpage>121</fpage>
          -
          <lpage>132</lpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Breß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Funke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Teubner</surname>
          </string-name>
          .
          <article-title>Robust query processing in co-processor-accelerated databases</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cijvat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kersten</surname>
          </string-name>
          , et al.
          <article-title>Genome sequence analysis with monetdb</article-title>
          .
          <source>Datenbank-Spektrum</source>
          ,
          <volume>6</volume>
          (
          <issue>17</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>DePristo</surname>
          </string-name>
          , E. Banks,
          <string-name>
            <given-names>R.</given-names>
            <surname>Poplin</surname>
          </string-name>
          , et al.
          <article-title>A framework for variation discovery and genotyping using next-generation DNA sequencing data</article-title>
          .
          <source>Nat Genet</source>
          ,
          <volume>43</volume>
          (
          <issue>5</issue>
          ):
          <fpage>491</fpage>
          -
          <lpage>498</lpage>
          , May
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dorok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Breß</surname>
          </string-name>
          , H. La¨pple, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Saake</surname>
          </string-name>
          .
          <article-title>Toward efficient and reliable genome analysis using main-memory database systems</article-title>
          .
          <source>In SSDBM</source>
          , pages
          <volume>34</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          :
          <fpage>4</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dorok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Breß</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Saake</surname>
          </string-name>
          .
          <article-title>Toward efficient variant calling inside main-memory database systems</article-title>
          .
          <source>In BIOKDD-DEXA. IEEE</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dorok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Breß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Teubner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Saake</surname>
          </string-name>
          .
          <article-title>Flexible analysis of plant genomes in a database management system</article-title>
          .
          <source>In EDBT</source>
          , pages
          <fpage>509</fpage>
          -
          <lpage>512</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ewing</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Green</surname>
          </string-name>
          .
          <article-title>Base-calling of automated sequencer traces using phred. II. Error probabilities</article-title>
          .
          <source>Genome Research</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <fpage>186</fpage>
          -
          <lpage>194</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Fa¨hnrich, M. Schapranow, and</article-title>
          <string-name>
            <given-names>H.</given-names>
            <surname>Plattner</surname>
          </string-name>
          .
          <article-title>Facing the genome data deluge: efficiently identifying genetic variants with in-memory database technology</article-title>
          .
          <source>In SAC</source>
          , pages
          <fpage>18</fpage>
          -
          <lpage>25</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Krikellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Viglas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Cintra</surname>
          </string-name>
          .
          <article-title>Generating code for holistic query evaluation</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>613</fpage>
          -
          <lpage>624</lpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Handsaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wysoker</surname>
          </string-name>
          , et al.
          <article-title>The sequence alignment/map format and samtools</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>25</volume>
          (
          <issue>16</issue>
          ):
          <fpage>2078</fpage>
          -
          <lpage>2079</lpage>
          , Aug.
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Homer</surname>
          </string-name>
          .
          <article-title>A survey of sequence alignment algorithms for next-generation sequencing</article-title>
          .
          <source>Briefings in Bioinformatics</source>
          ,
          <volume>11</volume>
          (
          <issue>5</issue>
          ):
          <fpage>473</fpage>
          -
          <lpage>483</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mavaddat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Peock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Frost</surname>
          </string-name>
          , et al.
          <article-title>Cancer risks for BRCA1 and BRCA2 mutation carriers: Results from prospective analysis of EMBRACE</article-title>
          .
          <source>Journal of the National Cancer Institute</source>
          , pages
          <fpage>djt095</fpage>
          +, Apr.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Paul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Albrechtsen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Song</surname>
          </string-name>
          .
          <article-title>Genotype and SNP calling from next-generation sequencing data</article-title>
          .
          <source>Nat. Rev. Genet</source>
          .,
          <volume>12</volume>
          (
          <issue>6</issue>
          ):
          <fpage>443</fpage>
          -
          <lpage>51</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Quail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Coupland</surname>
          </string-name>
          , et al.
          <article-title>A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers</article-title>
          .
          <source>BMC Genomics</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ):
          <fpage>341</fpage>
          +,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>The 1000 Genomes Project Consortium. A global reference for human genetic variation</article-title>
          .
          <source>Nature</source>
          ,
          <volume>526</volume>
          (
          <issue>7571</issue>
          ):
          <fpage>68</fpage>
          -
          <lpage>74</lpage>
          , Sept.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>