=Paper=
{{Paper
|id=Vol-1594/paper8
|storemode=property
|title=Memory Efficient Processing of DNA Sequences in Relational Main-Memory Database Systems
|pdfUrl=https://ceur-ws.org/Vol-1594/paper8.pdf
|volume=Vol-1594
|authors=Sebastian Dorok
|dblpUrl=https://dblp.org/rec/conf/gvd/Dorok16
}}
==Memory Efficient Processing of DNA Sequences in Relational Main-Memory Database Systems==
<pdf width="1500px">https://ceur-ws.org/Vol-1594/paper8.pdf</pdf>
<pre>
         Memory Efficient Processing of DNA Sequences in
           Relational Main-Memory Database Systems

                                                        Sebastian Dorok
                                                       Bayer Pharma AG
                                           Otto-von-Guericke-University Magdeburg
                                  Institute for Technical and Business Information Systems
                                                     Magdeburg, Germany
                                                 sebastian.dorok@ovgu.de

ABSTRACT                                                             database management system. Such an approach would
Pipeline breaking operators such as aggregations or joins re-        eliminate several data management issues within genome
quire database systems to materialize intermediate results.          analysis such as result traceability and repeatability as data
In case that the database system exceeds main memory                 can be analyzed from end-to-end in a database system [7].
capacities due to large intermediate results, main-memory               The size of single genome data sets varies between sev-
database systems experience a massive performance degra-             eral hundred to thousand megabytes. Besides varying data
dation due to paging or even abort queries.                          set size, also the analysis target varies between querying
   In our current research on efficiently analyzing DNA se-          small genome regions such as genes and bulk processing
quencing data using main-memory database systems, we of-             complete genomes. With increasing data sizes to analyze,
ten face the problem that main memory becomes scarce due             larger amounts of tuples have to be processed limiting the
to large intermediate results during hash join and sort-based        applicability of main-memory database systems as pipeline-
aggregation processing. Therefore, in this paper, we discuss         breaking operators such as aggregations and joins require the
alternative join and aggregation techniques suited for our           database system to materialize intermediate results. Such
use case and compare their characteristics regarding mem-            result materialization accompanies materialization costs and
ory requirements during processing. Moreover, we evalu-              increases main memory consumption within main-memory
ate different combinations of these techniques with regard           database systems. During analysis of DNA sequencing data,
to overall execution runtime and scalability to increasing           large amounts of tuples have to be processed due to low se-
amounts of data to process. We show that a combination of            lectivity. Thus, it is quite common that the intermediate
invisible join and array-based aggregation increases memory          result materialization exceeds available main memory lead-
efficiency enabling to query genome ranges that are one or-          ing to a massive performance degradation of the system.
der of magnitude larger than using a hash join counterpart              Current work in the field of genome analysis using main-
in combination with sort-based aggregation.                          memory database systems does not address this scalability
                                                                     issue explicitly. Fähnrich et al. show that main-memory
                                                                     technology outperforms highly tuned analysis tools [11]. The
General Terms                                                        authors examine the scalability of their approach with re-
Database, Genome Analysis                                            gard to multiple threads but not regarding data size. Civjat
                                                                     et al. use the main-memory database system MonetDB to
Keywords                                                             process DNA sequencing data but do not provide insights
                                                                     into the scalability of their approach regarding data size [5].
Variant Calling, Invisible Join, Array-based Aggregation                In this work, we focus on a base-centric representation
                                                                     of DNA sequences and investigate different join and aggre-
1.     INTRODUCTION                                                  gation techniques with regard to their overall runtime and
  Recent research on efficiently querying DNA sequencing             scalability to increasing genome ranges to query.
data using relational database systems focuses on using main-           The remainder of the paper is structured as follows. In
memory technology [5, 9, 11]. The goal is to benefit from            Section 2, we provide basic information about genome anal-
the increased analysis performance and to enable research            ysis, our used database schema and the executed query type.
departments and scientists to perform analyses of DNA se-            In Section 3, we discuss different join and aggregation pro-
quencing data efficiently and in a scalable way in a relational      cessing techniques from the perspective of our use case. In
                                                                     Section 4, we compare the different combinations of the dis-
                                                                     cussed processing techniques and evaluate their overall run-
                                                                     time and scalability with regard to different data sizes to
                                                                     process. In Section 5, we discuss related work.


                                                                     2.   BACKGROUND
  th                                                                   In this section, we explain the basic steps of genome anal-
28 GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 24.05.2016 - 27.05.2016, Nörten-Hardenberg, Germany.        ysis. Then, we explain how we can perform variant calling,
Copyright is held by the author/owner(s).                            a typical genome analysis task, using SQL.


                                                                39
         Memory efficient processing of DNA sequences in relational main-memory database systems


                          Reference               Reference
                                                                            Reference             Contig             Reference_Base
                                                                      RG_ID   OID           C_ID        OID       RB_ID           OID
                                                                      RG_NAME VARCHAR       C_NAME      VARCHAR   RB_BASE_VALUE   CHAR
                                                                                            C_RG_ID     OID       RB_POSITION 2   INT
                                                                                                                  RB_C_ID     1   OID

                                                                       sort order                                       Sample_Base
DNA Sequencing         Read Mapping            Variant Calling         primary 1                                  SB_ID               OID
                                                                       secondary 2                    Read        SB_BASE_VALUE       CHAR
                                                                                            R_ID        OID       SB_BASE_QUALITY     CHAR
Figure 1: The genome analysis process in brief.                              Sample         R_QNAME     VARCHAR   SB_INSERT_OFFSET    INT
                                                                      SG_ID   OID           R_MAPQ      INT       SB_RB_ID      2     OID
DNA sequencers generate reads that are mapped                         SG_NAME VARCHAR       R_SG_ID     OID       SB_READ_ID    1     OID
against a reference. Afterwards, scientists analyze
DNA sequencing data, e.g. they search for variants.
                                                                      Figure 2: The base-centric database schema explic-
                                                                      itly encodes every single bases of a read and genome
2.1     Genome analysis                                               region (contig) allowing for direct access via SQL.
  A basic genome analysis pipeline consists of two main
steps: (1) DNA sequencing & read mapping and (2) anal-
ysis of DNA sequencing data. In Figure 1, we depict the               form of mapped reads. Moreover, we explain how we can
two main steps. In the following, we briefly introduce both           integrate variant calling, in particular SNP calling, into a
analysis steps.                                                       relational database system via SQL to reveal the critical
                                                                      parts of the query.
2.1.1    DNA sequencing & read mapping                                2.2.1         Database schema
  DNA sequencing makes genetic information stored in DNA
                                                                         To represent mapped reads, we proposed the base-centric
molecules digitally readable. To this end, DNA sequencers
                                                                      database schema [9]. It explicitly encodes the connection
convert DNA molecules into sequences of the characters A,
                                                                      between single bases. Thus, it allows for direct access to all
C, G, and T. Every single character encodes a nucleobase
                                                                      data via SQL that is required for base-centric analyses such
making up DNA molecules. As DNA sequencing is error-
                                                                      as SNP calling. The schema consists of 6 tables. We depict
prone, every base is associated with an error probability that
                                                                      it in Figure 2. Every Reference consists of a set of regions
indicates the probability that the base is wrong [10]. The
                                                                      called Contigs. A contig can be a chromosome that consists
sequences of characters are called reads each associated with
                                                                      of single Reference_Bases. For example, chromosome 1 of a
a so called base call quality string encoding the error prob-
                                                                      human genome consists of nearly 250 million bases. A Sam-
abilities of all bases in ASCII. DNA sequencing techniques
                                                                      ple consists of several mapped Reads which consist of single
are only able to generate small reads from a given DNA
                                                                      Sample_Bases. For example, the low coverage genome of
molecule [17]. Thus, these small reads must be assembled
                                                                      the sample HG00096 provided by the 1000 genomes project
to reconstruct the original DNA. Common tools to assem-
                                                                      consists of nearly 14 billion bases1 . Every Sample_Base is
ble reads are read mappers [14]. Read mapping tools utilize
                                                                      mapped to one Reference_Base which is encoded using a
known reference sequences of organisms and try to find the
                                                                      foreign-key relationship.
best matching position for every read. Read mappers have
to deal with several difficulties such as deletions, insertions       2.2.2         SNP calling via SQL
and mismatches within reads. Therefore, read mappings are
                                                                        The base-centric database schema enables direct access
also associated with an error probability.
                                                                      to every single base within a genome. We can use the
2.1.2    Variant calling                                              SQL query in Listing 1 to compute the genotype using a
                                                                      user-defined-aggregation function and finally call variants by
   Usually, scientists are not interested in mapped reads but         comparing the result of the aggregation with the correspond-
in variations within certain ranges of the genome such as             ing reference base. The query processing can be split into
genes or chromosomes. Genome positions that differ from a             two separate phases:
given reference are called variants. Variants of single bases
are a special class of variants and called Single nucleotide          1. Filter & Join. In the first phase, we select bases of in-
polymorphisms (SNPs). Detecting SNPs is of high interest              terest that are not inserted, have high base call quality and
as these are known to trigger diseases such as cancer [15].           belong to reads with high mapping quality (cf. Lines 10–12)
The task of variant calling is to first decide on a genotype          by joining the required tables (cf. Lines 6–8).
that is determined by all bases covering a genome site. Then,         2. Aggregate & filter. Finally, the genotype aggregation
the genotype is compared against the reference base. Two              is performed using a user-defined aggregation function. Af-
general approaches exist to compute the genotype [16]. One            terwards, the aggregation result is compared to the reference
idea is to count all appearing bases at a genome site and de-         base to check whether a SNP is present or not (cf. Line 16).
cide based on the frequency which genotype is present. More
sophisticated approaches incorporate the available quality
information to compute the probabilities of possible geno-
                                                                      3.     MEMORY EFFICIENT SNP CALLING
types. Overall, the computation of genotypes is an aggrega-              In Section 2, we explained that variant calling within a re-
tion of bases that are mapped to the same genome site.                lational database system consists of two main phases: filter
                                                                      & join and aggregate & filter. In this section, we discuss dif-
2.2     Variant calling in RDBMS                                      ferent implementations to execute these phases of the query.
  In this section, we briefly introduce the base-centric stor-        1
                                                                        ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/
age database schema to store DNA sequencing data in the               data/HG00096/


                                                                 40
           Memory efficient processing of DNA sequences in relational main-memory database systems


 1   select                                                                    cates on the fact table. We can apply between-predicate-
 2          C NAME, RB POSITION , RB BASE VALUE ,                              rewriting to rewrite the predicate on C_NAME as between
 3           g e n o t y p e (SB BASE VALUE) a s g en o t y p e ,              predicate on SB_RB_ID as we can guarantee that Refer-
 4           c o u n t (SB BASE VALUE) a s c o v e r a g e                     ence_Bases are sorted by RB_C_ID and RB_POSITION. Af-
 5   from
 6           C o n t i g j o i n R e f e r e n c e B a s e on C ID = RB C ID
                                                                               ter applying the rewritten between predicate, we reduce the
 7           j o i n Sample Base on RB ID = SB RB ID                           number of selected Sample_Base tuples further in a semi-
 8           j o i n Read on R ID = SB READ ID                                 join fashion. Finally, we compute the join starting at the
 9   where                                                                     fact table Sample_Base gathering all corresponding tuples
10          C NAME = ’ ? ’ and SB INSERT OFFSET = 0                            from tables Read and Reference_Base. Thus, the invisible
11           and SB BASE CALL QUALITY > ?
12           and R MAPQ > ?
                                                                               join keeps the memory footprint low as only the tuple iden-
13   group by                                                                  tifiers of table Sample_Base are materialized and pruned.
14          C NAME, RB POSITION , RB BASE VALUE                                   Based on this qualitative discussion, we expect that the
15   having                                                                    invisible join has less memory overhead than using a hash
16          RB BASE VALUE <> g e n o t y p e and                               join. Moreover, the invisible join avoids building and prob-
17           c o v e r a g e > ? and
                                                                               ing hash tables, which should lead to further performance
18          RB BASE VALUE <> ’N ’ and g e n o t y p e <> ’X ’
19   o r d e r by                                                              improvements.
20          C NAME, RB POSITION ;
                                                                               3.2   Aggregate & filter
Listing 1: SNP calling via SQL. A custom aggrega-                                 Two general approaches for computing aggregates are sort-
tion function is used to compute the domain-specific                           based and hash-based techniques. In this section, we com-
aggregate genotype.                                                            pare a sort-based approach with a special kind of hash-based
                                                                               aggregation called array-based aggregation.
                                                                                  Sort-based aggregation. Considering the query in List-
                                                                               ing 1, the grouping clause is a superset of the order by
In particular, we concentrate on the implementation of the                     clause. Thus, it seems to be beneficial to perform a sort-
join and aggregation. Moreover, we assess the alternative                      based aggregation as it directly leads to the required sorting
strategies according to their memory consumption that we                       of the result. Analyzing functional dependencies, we can
want to reduce to improve the memory efficiency and inher-                     reduce the set of group-by columns to the single column
ently the scalability of queries to larger genome ranges.                      RB_ID. Thus, we can replace the group-by clause C_NAME,
                                                                               RB_POSITION and RB_BASE_VALUE in the SNP calling query
3.1     Filter & join                                                          with RB_ID. For that reason, we only have to sort a single
   Within relational database systems, joins are one of the                    column. Nevertheless, we have to reorder all columns that
most expensive operations. In this work, we use a hash join                    are aggregated. This leads to materialization overhead and
as baseline and compare it with the invisible join technique.                  additional memory consumption as the sorted and unsorted
   Hash join. The hash join approach consists of a build                       columns must be kept in memory.
and probe phase. The idea is to index the tuples of one                           Array-based aggregation. Knowing that the grouping
relation using a hash table. Afterwards, the join pairs are                    is done via column RB_ID, we can apply array-based aggre-
computed by probing the tuples of the second relation. One                     gation based on the idea by Krikellas et al. using mapping
source of high memory consumption are the hash tables cre-                     directories for grouping attributes to compute offsets within
ated during join processing, especially if the joined tables                   a grouping array [12]. As RB_ID is a primary key, we can
are large. To reduce the table size, the query optimizer                       also use it as array index for our aggregation, i.e. having a
pushes down selections before joining tables. Then, the fil-                   perfect hash function. Thus, we do not need mapping direc-
tered tables will be joined. Thus, before computing the join,                  tories at all. Furthermore, we can guarantee that the RB_ID
all join columns are materialized. Consequently, predicates                    column reflects the ordering given in the order-by clause,
with low selectivity lead to high memory consumption.                          thus, we inherently sort the data as required while we ag-
   Invisible join. An alternative join technique is the in-                    gregate it. In case, the selected RB_IDs do not start at 0, we
visible join [1]. The basic idea is to use foreign-key rela-                   can determine the smallest RB_ID and use it as offset to fill
tionships and positional lookups to compute matching join                      the array starting at the first index. Thus, the additional
pairs. The technique was proposed in the context of process-                   memory consumption of this approach is determined by the
ing data warehouse schemata such as the star- or snowflake                     size of the array for keeping the intermediate aggregates.
schema. The base-centric database schema for storing DNA                          Based on this qualitative discussion, we expect that the
sequencing data is a star schema. Every primary key col-                       array-based aggregation has less memory overhead than us-
umn starts at 0 and following primary keys are incremented.                    ing a sort-based aggregation. Moreover, we expect that
Hence, this setup enables using foreign keys on such primary                   avoiding to sort saves additional runtime.
key columns as row index to any column of the primary
key table which can be efficiently implemented in column                       3.3   Summary
stores. Knowing the required tuples of the fact table, i.e.                       For each of the two phases, we presented two different
Sample_Base, we can simply gather all required data for                        techniques and discussed their properties regarding main-
downstream query processing using the foreign key columns                      memory consumption. We can combine these four approaches
SB_READ_ID and SB_RB_ID.                                                       leading to four different implementation stacks to execute
   In order to make this join processing strategy efficient,                   the same query. Based on our discussion, we expect that
we have to be able to apply between-predicate rewriting to                     the invisible join in combination with array-based aggrega-
express predicates on dimension tables as between predi-                       tion will require less memory than all other combinations.


                                                                          41
                  Memory efficient processing of DNA sequences in relational main-memory database systems


                 60               Aggregate & filter       Filter & join                            Invisible & Array         Invisible & Sort
                                                                                              103
Runtime (min)


                                                                                                       Hash & Array              Hash & Sort


                                                                                Runtime (s)
                 40
                                                                                              102

                 20
                                                                                              101
                 0
                      Hash &     Hash &      Invisible &     Invisible &                            106                 107                      108
                       Sort      Array          Sort           Array
                                                                                                                Genome range size

Figure 3: Breakdown of runtime performance of
SNP calling over complete low coverage human                                    Figure 4: Scalability of different implementation
genome data set. Invisible join in combination with                             combinations when calling SNPs on chromosome 1
array-based aggregation outperforms the baseline by                             of a high coverage human genome data set. Using
factor 3.                                                                       array-based aggregation and invisible join, we can
                                                                                query larger genome ranges.

4.              EVALUATION
   In this section, we evaluate the resulting four different
                                                                                4.3           Memory efficiency
combinations of join and aggregation strategies regarding                          In our second experiment, we investigate the memory ef-
their main-memory efficiency which we want to improve by                        ficiency of all four implementation combinations by calling
reducing the memory consumption. We expect that reduc-                          SNPs on a high coverage chromosome 1. This dataset con-
ing the memory consumption and, therefore, the material-                        sists only of chromosome 1 and contains ca. 11 billion sample
ization overhead, the runtime performance will increase. To                     bases. Thus, per genome site ca. 50 bases have to be aggre-
this end, we measure the runtime when calling SNPs on a                         gated. Using this data set, we can scale our query range from
complete genome.                                                                1,000,000 genome sites up to 100,000,000. In case memory
                                                                                becomes scarce, the system interrupts the query execution.
4.1             Experimental setup                                              Thus, we can indirectly assess the memory efficiency of ev-
                                                                                ery implementation combination. We report the results of
   As evaluation platform, we use a machine with two In-
                                                                                this experiment in Figure 4.
tel Xeon CPU E5-2609 v2 with four cores @2.5 GHz and
                                                                                   In accordance to the first experiment, the invisible join
256 GB main memory. On the software side, we use Ubuntu
                                                                                implementations are faster than the hash join implementa-
14.04.3 (64 Bit) as operating system. Before starting the ex-
                                                                                tions. On smaller genome ranges the runtime difference is
periments, we pre-load the database into main memory. As
                                                                                up to one order of magnitude. The invisible join benefits
evaluation database system, we use CoGaDB [4], a column-
                                                                                more from higher selectivity given when querying smaller
oriented main-memory database system2 .
                                                                                genome regions. The between-predicate-rewriting allows to
   For our experiments, we use human genome data sets of
                                                                                prune the fact table Sample_Base drastically leading to less
sample HG00096 from the 1000 genome project [18].
                                                                                effort for gathering the join pairs.
                                                                                   Considering the scalability of the four approaches, we ob-
4.2             SNP calling runtime                                             serve that using the invisible join, we can query genome
   In our first experiment, we call SNPs on a complete hu-                      ranges that are one order magnitude larger than using a
man genome that consists of ca. 14 billion sample bases.                        hash join implementation at nearly the same speed. This
We perform the SNP calling on single chromosomes in or-                         confirms our expectations from Section 3 regarding memory
der to not run out of memory as we are interested in the                        efficiency of the invisible join. Additionally, the array-based
overall runtime of the approaches. We report the runtimes                       aggregation allows us to query slightly larger genome ranges
in Figure 3.                                                                    than using a sort-based aggregation until we run out of mem-
   We observe that the array-based aggregation approach al-                     ory. Overall, the impact of the used aggregation technique
ways reduces the runtime independent of the join approach.                      on memory efficiency is smaller than those of the used join
Using array-based aggregation, we avoid expensive sorting                       technique. The hash join implementation always dominates
and reordering of columns as we can compute the required                        memory efficiency.
order on-the-fly (cf. Section 3.2). In combination with the
invisible join, we can reduce the overall runtime of the join                   4.4           Summary
and aggregation phases by factor three compared to the                             Our evaluation shows that for our use case, the invisible
slowest implementation using a hash join and sort-based ag-                     join technique in combination with array-based aggregation
gregation. The runtime reduction can be explained by less                       provides best memory efficiency and runtime performance.
overhead when computing the join as well as less materi-                        Moreover, we observe a coherence between scalability to
alization of intermediate results. When using the hash join                     larger genome ranges and processing runtime.
implementation, all join columns are materialized before the
join. Additionally, the sort-based aggregation requires sort-
ing of the grouping column and reordering of aggregation                        5.            RELATED WORK
columns.                                                                          Within our work, we focus on using main-memory database
                                                                                systems for genome analysis use cases. In previous work,
2
    http://cogadb.cs.tu-dortmund.de/wordpress/                                  we mainly concentrated on expressing analysis tasks [9] and


                                                                           42
         Memory efficient processing of DNA sequences in relational main-memory database systems


showing the potential of such an approach [7, 8]. Within           [4] S. Breß, H. Funke, and J. Teubner. Robust query
this work, we analyze the problem of memory scarcity dur-              processing in co-processor-accelerated databases. In
ing processing limiting the scalability of our approach.               SIGMOD, 2016.
   We are aware of two other approaches that explicitly use        [5] R. Cijvat, S. Manegold, M. Kersten, et al. Genome
main-memory database techniques to speed up genome anal-               sequence analysis with monetdb. Datenbank-Spektrum,
ysis tasks. Fähnrich et al. present an approach to integrate          6(17), 2015.
SNP calling using a map-reduce like processing scheme [11].        [6] M. A. DePristo, E. Banks, R. Poplin, et al. A
In contrast, our approach only uses relational database oper-          framework for variation discovery and genotyping
ators. Moreover, within their evaluation, Fähnrich et al. use         using next-generation DNA sequencing data. Nat
two computing nodes each having 1 TB of main memory for                Genet, 43(5):491–498, May 2011.
processing a data set with ca. 17 billion sample bases. Thus,      [7] S. Dorok, S. Breß, H. Läpple, and G. Saake. Toward
the problem of memory scarcity during processing is unlikely           efficient and reliable genome analysis using
to appear. Cijvat et al. use MonetDB to analyze Ebola data             main-memory database systems. In SSDBM, pages
sets. Within their evaluation, they evaluate the runtime of            34:1–34:4. ACM, 2014.
different analysis queries [5]. Our work complements these         [8] S. Dorok, S. Breß, and G. Saake. Toward efficient
approaches, in particular the MonetDB approach.                        variant calling inside main-memory database systems.
   State-of-the-art approaches to perform SNP calling use              In BIOKDD-DEXA. IEEE, 2014.
specialized analysis tools that operate on flat files such as
                                                                   [9] S. Dorok, S. Breß, J. Teubner, and G. Saake. Flexible
GATK [6] and samtools [13]. These tools are designed to ef-
                                                                       analysis of plant genomes in a database management
ficiently operate in disk-based systems using machines with
                                                                       system. In EDBT, pages 509–512, 2015.
small main memory. Therefore, both tools rely on input
data that is sorted by genomic region which is the group-         [10] B. Ewing and P. Green. Base-calling of automated
ing predicate for the aggregation. Thus, when reading data             sequencer traces using phred. II. Error probabilities.
from disk, the tools can decide early when all data for a ge-          Genome Research, 8(3):186–194, 1998.
nomic site is read to compute the aggregate and write the         [11] C. Fähnrich, M. Schapranow, and H. Plattner. Facing
result back to disk. Such an approach is not applicable in             the genome data deluge: efficiently identifying genetic
a main-memory only approach as data and results reside in              variants with in-memory database technology. In SAC,
memory. Moreover, in our approach, we have to perform                  pages 18–25, 2015.
the join computation due to schema normalization before           [12] K. Krikellas, S. Viglas, and M. Cintra. Generating
we can aggregate the data.                                             code for holistic query evaluation. In ICDE, pages
                                                                       613–624. IEEE, 2010.
                                                                  [13] H. Li, B. Handsaker, A. Wysoker, et al. The sequence
6.   CONCLUSION & FUTURE WORK                                          alignment/map format and samtools. Bioinformatics,
   In this paper, we discuss different techniques for join and         25(16):2078–2079, Aug. 2009.
aggregation processing in the context of analyzing DNA se-        [14] H. Li and N. Homer. A survey of sequence alignment
quencing data using main-memory database systems. Choos-               algorithms for next-generation sequencing. Briefings
ing the optimal implementation combination increases main-             in Bioinformatics, 11(5):473–483, 2010.
memory efficiency significantly and hence, allows for query-      [15] N. Mavaddat, S. Peock, D. Frost, et al. Cancer risks
ing larger genome ranges or data sets.                                 for BRCA1 and BRCA2 mutation carriers: Results
   Future work. We are aware of improvements regarding                 from prospective analysis of EMBRACE. Journal of
memory efficiency of in-memory hash joins [2, 3]. Therefore,           the National Cancer Institute, pages djt095+, Apr.
in future work, it might be interesting to use such advanced           2013.
in-memory hash join techniques. Moreover, we have not             [16] R. Nielsen, J. S. Paul, A. Albrechtsen, and Y. S. Song.
investigated the behaviour of sort-merge joins within this             Genotype and SNP calling from next-generation
work. A second direction in order to increase the scalability          sequencing data. Nat. Rev. Genet., 12(6):443–51, 2011.
of genome analysis tasks in main-memory database systems          [17] M. Quail, M. Smith, P. Coupland, et al. A tale of three
is to reduce the size of the primary database using light-             next generation sequencing platforms: comparison of
weight compression, which frees main memory for process-               Ion Torrent, Pacific Biosciences and Illumina MiSeq
ing. Finally, partitioning strategies have to be investigated.         sequencers. BMC Genomics, 13(1):341+, 2012.
                                                                  [18] The 1000 Genomes Project Consortium. A global
                                                                       reference for human genetic variation. Nature,
7.   REFERENCES                                                        526(7571):68–74, Sept. 2015.

 [1] D. Abadi, S. Madden, and N. Hachem. Column-stores
     vs. row-stores: How different are they really? In
     SIGMOD, pages 967–980. ACM, 2008.
 [2] R. Barber, G. Lohman, I. Pandis, et al.
     Memory-efficient hash joins. Proc. VLDB Endow.,
     8(4):353–364, 2014.
 [3] S. K. Begley, Z. He, and Y.-P. P. Chen. Mcjoin: A
     memory-constrained join for column-store
     main-memory databases. In SIGMOD, pages 121–132,
     New York, NY, USA, 2012. ACM.


                                                             43

</pre>