<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Memory Efficient Processing of DNA Sequences in Relational Main-Memory Database Systems</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Sebastian</forename><surname>Dorok</surname></persName>
							<email>sebastian.dorok@ovgu.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Otto-von-Guericke-University Magdeburg Institute for Technical and Business Information Systems</orgName>
								<address>
									<settlement>Magdeburg</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Bayer</forename><surname>Pharma</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Otto-von-Guericke-University Magdeburg Institute for Technical and Business Information Systems</orgName>
								<address>
									<settlement>Magdeburg</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Memory Efficient Processing of DNA Sequences in Relational Main-Memory Database Systems</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">77B1888F10A21E712C6345F90F0136D4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-19T18:01+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Database</term>
					<term>Genome Analysis Variant Calling</term>
					<term>Invisible Join</term>
					<term>Array-based Aggregation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Pipeline breaking operators such as aggregations or joins require database systems to materialize intermediate results.</p><p>In case that the database system exceeds main memory capacities due to large intermediate results, main-memory database systems experience a massive performance degradation due to paging or even abort queries.</p><p>In our current research on efficiently analyzing DNA sequencing data using main-memory database systems, we often face the problem that main memory becomes scarce due to large intermediate results during hash join and sort-based aggregation processing. Therefore, in this paper, we discuss alternative join and aggregation techniques suited for our use case and compare their characteristics regarding memory requirements during processing. Moreover, we evaluate different combinations of these techniques with regard to overall execution runtime and scalability to increasing amounts of data to process. We show that a combination of invisible join and array-based aggregation increases memory efficiency enabling to query genome ranges that are one order of magnitude larger than using a hash join counterpart in combination with sort-based aggregation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>database management system. Such an approach would eliminate several data management issues within genome analysis such as result traceability and repeatability as data can be analyzed from end-to-end in a database system <ref type="bibr" target="#b6">[7]</ref>.</p><p>The size of single genome data sets varies between several hundred to thousand megabytes. Besides varying data set size, also the analysis target varies between querying small genome regions such as genes and bulk processing complete genomes. With increasing data sizes to analyze, larger amounts of tuples have to be processed limiting the applicability of main-memory database systems as pipelinebreaking operators such as aggregations and joins require the database system to materialize intermediate results. Such result materialization accompanies materialization costs and increases main memory consumption within main-memory database systems. During analysis of DNA sequencing data, large amounts of tuples have to be processed due to low selectivity. Thus, it is quite common that the intermediate result materialization exceeds available main memory leading to a massive performance degradation of the system.</p><p>Current work in the field of genome analysis using mainmemory database systems does not address this scalability issue explicitly. <ref type="bibr">Fähnrich et al.</ref> show that main-memory technology outperforms highly tuned analysis tools <ref type="bibr" target="#b11">[11]</ref>. The authors examine the scalability of their approach with regard to multiple threads but not regarding data size. Civjat et al. use the main-memory database system MonetDB to process DNA sequencing data but do not provide insights into the scalability of their approach regarding data size <ref type="bibr" target="#b4">[5]</ref>.</p><p>In this work, we focus on a base-centric representation of DNA sequences and investigate different join and aggregation techniques with regard to their overall runtime and scalability to increasing genome ranges to query.</p><p>The remainder of the paper is structured as follows. In Section 2, we provide basic information about genome analysis, our used database schema and the executed query type. In Section 3, we discuss different join and aggregation processing techniques from the perspective of our use case. In Section 4, we compare the different combinations of the discussed processing techniques and evaluate their overall runtime and scalability with regard to different data sizes to process. In Section 5, we discuss related work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">BACKGROUND</head><p>In this section, we explain the basic steps of genome analysis. Then, we explain how we can perform variant calling, a typical genome analysis task, using SQL.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reference</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DNA Sequencing</head><p>Read Mapping</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reference</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Variant Calling</head><p>Figure <ref type="figure">1</ref>: The genome analysis process in brief. DNA sequencers generate reads that are mapped against a reference. Afterwards, scientists analyze DNA sequencing data, e.g. they search for variants.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Genome analysis</head><p>A basic genome analysis pipeline consists of two main steps: (1) DNA sequencing &amp; read mapping and (2) analysis of DNA sequencing data. In Figure <ref type="figure">1</ref>, we depict the two main steps. In the following, we briefly introduce both analysis steps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1">DNA sequencing &amp; read mapping</head><p>DNA sequencing makes genetic information stored in DNA molecules digitally readable. To this end, DNA sequencers convert DNA molecules into sequences of the characters A, C, G, and T. Every single character encodes a nucleobase making up DNA molecules. As DNA sequencing is errorprone, every base is associated with an error probability that indicates the probability that the base is wrong <ref type="bibr" target="#b9">[10]</ref>. The sequences of characters are called reads each associated with a so called base call quality string encoding the error probabilities of all bases in ASCII. DNA sequencing techniques are only able to generate small reads from a given DNA molecule <ref type="bibr" target="#b17">[17]</ref>. Thus, these small reads must be assembled to reconstruct the original DNA. Common tools to assemble reads are read mappers <ref type="bibr" target="#b14">[14]</ref>. Read mapping tools utilize known reference sequences of organisms and try to find the best matching position for every read. Read mappers have to deal with several difficulties such as deletions, insertions and mismatches within reads. Therefore, read mappings are also associated with an error probability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.2">Variant calling</head><p>Usually, scientists are not interested in mapped reads but in variations within certain ranges of the genome such as genes or chromosomes. Genome positions that differ from a given reference are called variants. Variants of single bases are a special class of variants and called Single nucleotide polymorphisms (SNPs). Detecting SNPs is of high interest as these are known to trigger diseases such as cancer <ref type="bibr" target="#b15">[15]</ref>. The task of variant calling is to first decide on a genotype that is determined by all bases covering a genome site. Then, the genotype is compared against the reference base. Two general approaches exist to compute the genotype <ref type="bibr" target="#b16">[16]</ref>. One idea is to count all appearing bases at a genome site and decide based on the frequency which genotype is present. More sophisticated approaches incorporate the available quality information to compute the probabilities of possible genotypes. Overall, the computation of genotypes is an aggregation of bases that are mapped to the same genome site.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Variant calling in RDBMS</head><p>In this section, we briefly introduce the base-centric storage database schema to store DNA sequencing data in the form of mapped reads. Moreover, we explain how we can integrate variant calling, in particular SNP calling, into a relational database system via SQL to reveal the critical parts of the query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1">Database schema</head><p>To represent mapped reads, we proposed the base-centric database schema <ref type="bibr" target="#b8">[9]</ref>. It explicitly encodes the connection between single bases. Thus, it allows for direct access to all data via SQL that is required for base-centric analyses such as SNP calling. The schema consists of 6 tables. We depict it in Figure <ref type="figure" target="#fig_0">2</ref>. Every Reference consists of a set of regions called Contigs. A contig can be a chromosome that consists of single Reference_Bases. For example, chromosome 1 of a human genome consists of nearly 250 million bases. A Sample consists of several mapped Reads which consist of single Sample_Bases. For example, the low coverage genome of the sample HG00096 provided by the 1000 genomes project consists of nearly 14 billion bases<ref type="foot" target="#foot_0">1</ref> . Every Sample_Base is mapped to one Reference_Base which is encoded using a foreign-key relationship.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2">SNP calling via SQL</head><p>The base-centric database schema enables direct access to every single base within a genome. We can use the SQL query in Listing 1 to compute the genotype using a user-defined-aggregation function and finally call variants by comparing the result of the aggregation with the corresponding reference base. The query processing can be split into two separate phases:</p><p>1. Filter &amp; Join. In the first phase, we select bases of interest that are not inserted, have high base call quality and belong to reads with high mapping quality (cf. Lines 10-12) by joining the required tables (cf. Lines 6-8).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Aggregate &amp; filter.</head><p>Finally, the genotype aggregation is performed using a user-defined aggregation function. Afterwards, the aggregation result is compared to the reference base to check whether a SNP is present or not (cf. Line 16).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">MEMORY EFFICIENT SNP CALLING</head><p>In Section 2, we explained that variant calling within a relational database system consists of two main phases: filter &amp; join and aggregate &amp; filter. In this section, we discuss different implementations to execute these phases of the query. In particular, we concentrate on the implementation of the join and aggregation. Moreover, we assess the alternative strategies according to their memory consumption that we want to reduce to improve the memory efficiency and inherently the scalability of queries to larger genome ranges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Filter &amp; join</head><p>Within relational database systems, joins are one of the most expensive operations. In this work, we use a hash join as baseline and compare it with the invisible join technique.</p><p>Hash join. The hash join approach consists of a build and probe phase. The idea is to index the tuples of one relation using a hash table. Afterwards, the join pairs are computed by probing the tuples of the second relation. One source of high memory consumption are the hash tables created during join processing, especially if the joined tables are large. To reduce the table size, the query optimizer pushes down selections before joining tables. Then, the filtered tables will be joined. Thus, before computing the join, all join columns are materialized. Consequently, predicates with low selectivity lead to high memory consumption.</p><p>Invisible join. An alternative join technique is the invisible join <ref type="bibr" target="#b0">[1]</ref>. The basic idea is to use foreign-key relationships and positional lookups to compute matching join pairs. The technique was proposed in the context of processing data warehouse schemata such as the star-or snowflake schema. The base-centric database schema for storing DNA sequencing data is a star schema. Every primary key column starts at 0 and following primary keys are incremented. Hence, this setup enables using foreign keys on such primary key columns as row index to any column of the primary key table which can be efficiently implemented in column stores. Knowing the required tuples of the fact table, i.e. Sample_Base, we can simply gather all required data for downstream query processing using the foreign key columns SB_READ_ID and SB_RB_ID.</p><p>In order to make this join processing strategy efficient, we have to be able to apply between-predicate rewriting to express predicates on dimension tables as between predi-cates on the fact table. We can apply between-predicaterewriting to rewrite the predicate on C_NAME as between predicate on SB_RB_ID as we can guarantee that Refer-ence_Bases are sorted by RB_C_ID and RB_POSITION. After applying the rewritten between predicate, we reduce the number of selected Sample_Base tuples further in a semijoin fashion. Finally, we compute the join starting at the fact table Sample_Base gathering all corresponding tuples from tables Read and Reference_Base. Thus, the invisible join keeps the memory footprint low as only the tuple identifiers of table Sample_Base are materialized and pruned.</p><p>Based on this qualitative discussion, we expect that the invisible join has less memory overhead than using a hash join. Moreover, the invisible join avoids building and probing hash tables, which should lead to further performance improvements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Aggregate &amp; filter</head><p>Two general approaches for computing aggregates are sortbased and hash-based techniques. In this section, we compare a sort-based approach with a special kind of hash-based aggregation called array-based aggregation.</p><p>Sort-based aggregation. Considering the query in Listing 1, the grouping clause is a superset of the order by clause. Thus, it seems to be beneficial to perform a sortbased aggregation as it directly leads to the required sorting of the result. Analyzing functional dependencies, we can reduce the set of group-by columns to the single column RB_ID. Thus, we can replace the group-by clause C_NAME, RB_POSITION and RB_BASE_VALUE in the SNP calling query with RB_ID. For that reason, we only have to sort a single column. Nevertheless, we have to reorder all columns that are aggregated. This leads to materialization overhead and additional memory consumption as the sorted and unsorted columns must be kept in memory.</p><p>Array-based aggregation. Knowing that the grouping is done via column RB_ID, we can apply array-based aggregation based on the idea by Krikellas et al. using mapping directories for grouping attributes to compute offsets within a grouping array <ref type="bibr" target="#b12">[12]</ref>. As RB_ID is a primary key, we can also use it as array index for our aggregation, i.e. having a perfect hash function. Thus, we do not need mapping directories at all. Furthermore, we can guarantee that the RB_ID column reflects the ordering given in the order-by clause, thus, we inherently sort the data as required while we aggregate it. In case, the selected RB_IDs do not start at 0, we can determine the smallest RB_ID and use it as offset to fill the array starting at the first index. Thus, the additional memory consumption of this approach is determined by the size of the array for keeping the intermediate aggregates.</p><p>Based on this qualitative discussion, we expect that the array-based aggregation has less memory overhead than using a sort-based aggregation. Moreover, we expect that avoiding to sort saves additional runtime.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Summary</head><p>For each of the two phases, we presented two different techniques and discussed their properties regarding mainmemory consumption. We can combine these four approaches leading to four different implementation stacks to execute the same query. Based on our discussion, we expect that the invisible join in combination with array-based aggregation will require less memory than all other combinations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">EVALUATION</head><p>In this section, we evaluate the resulting four different combinations of join and aggregation strategies regarding their main-memory efficiency which we want to improve by reducing the memory consumption. We expect that reducing the memory consumption and, therefore, the materialization overhead, the runtime performance will increase. To this end, we measure the runtime when calling SNPs on a complete genome.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experimental setup</head><p>As evaluation platform, we use a machine with two Intel Xeon CPU E5-2609 v2 with four cores @2.5 GHz and 256 GB main memory. On the software side, we use Ubuntu 14.04.3 (64 Bit) as operating system. Before starting the experiments, we pre-load the database into main memory. As evaluation database system, we use CoGaDB <ref type="bibr" target="#b3">[4]</ref>, a columnoriented main-memory database system 2 .</p><p>For our experiments, we use human genome data sets of sample HG00096 from the 1000 genome project <ref type="bibr" target="#b18">[18]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">SNP calling runtime</head><p>In our first experiment, we call SNPs on a complete human genome that consists of ca. 14 billion sample bases. We perform the SNP calling on single chromosomes in order to not run out of memory as we are interested in the overall runtime of the approaches. We report the runtimes in Figure <ref type="figure" target="#fig_1">3</ref>.</p><p>We observe that the array-based aggregation approach always reduces the runtime independent of the join approach. Using array-based aggregation, we avoid expensive sorting and reordering of columns as we can compute the required order on-the-fly (cf. Section 3.2). In combination with the invisible join, we can reduce the overall runtime of the join and aggregation phases by factor three compared to the slowest implementation using a hash join and sort-based aggregation. The runtime reduction can be explained by less overhead when computing the join as well as less materialization of intermediate results. When using the hash join implementation, all join columns are materialized before the join. Additionally, the sort-based aggregation requires sorting of the grouping column and reordering of aggregation columns.</p><p>2 http://cogadb.cs.tu-dortmund.de/wordpress/  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Memory efficiency</head><p>In our second experiment, we investigate the memory efficiency of all four implementation combinations by calling SNPs on a high coverage chromosome 1. This dataset consists only of chromosome 1 and contains ca. 11 billion sample bases. Thus, per genome site ca. 50 bases have to be aggregated. Using this data set, we can scale our query range from 1,000,000 genome sites up to 100,000,000. In case memory becomes scarce, the system interrupts the query execution. Thus, we can indirectly assess the memory efficiency of every implementation combination. We report the results of this experiment in Figure <ref type="figure" target="#fig_2">4</ref>.</p><p>In accordance to the first experiment, the invisible join implementations are faster than the hash join implementations. On smaller genome ranges the runtime difference is up to one order of magnitude. The invisible join benefits more from higher selectivity given when querying smaller genome regions. The between-predicate-rewriting allows to prune the fact table Sample_Base drastically leading to less effort for gathering the join pairs.</p><p>Considering the scalability of the four approaches, we observe that using the invisible join, we can query genome ranges that are one order magnitude larger than using a hash join implementation at nearly the same speed. This confirms our expectations from Section 3 regarding memory efficiency of the invisible join. Additionally, the array-based aggregation allows us to query slightly larger genome ranges than using a sort-based aggregation until we run out of memory. Overall, the impact of the used aggregation technique on memory efficiency is smaller than those of the used join technique. The hash join implementation always dominates memory efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Summary</head><p>Our evaluation shows that for our use case, the invisible join technique in combination with array-based aggregation provides best memory efficiency and runtime performance. Moreover, we observe a coherence between scalability to larger genome ranges and processing runtime.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">RELATED WORK</head><p>Within our work, we focus on using main-memory database systems for genome analysis use cases. In previous work, we mainly concentrated on expressing analysis tasks <ref type="bibr" target="#b8">[9]</ref> and showing the potential of such an approach <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>. Within this work, we analyze the problem of memory scarcity during processing limiting the scalability of our approach.</p><p>We are aware of two other approaches that explicitly use main-memory database techniques to speed up genome analysis tasks. Fähnrich et al. present an approach to integrate SNP calling using a map-reduce like processing scheme <ref type="bibr" target="#b11">[11]</ref>. In contrast, our approach only uses relational database operators. Moreover, within their evaluation, Fähnrich et al. use two computing nodes each having 1 TB of main memory for processing a data set with ca. 17 billion sample bases. Thus, the problem of memory scarcity during processing is unlikely to appear. Cijvat et al. use MonetDB to analyze Ebola data sets. Within their evaluation, they evaluate the runtime of different analysis queries <ref type="bibr" target="#b4">[5]</ref>. Our work complements these approaches, in particular the MonetDB approach.</p><p>State-of-the-art approaches to perform SNP calling use specialized analysis tools that operate on flat files such as GATK <ref type="bibr" target="#b5">[6]</ref> and samtools <ref type="bibr" target="#b13">[13]</ref>. These tools are designed to efficiently operate in disk-based systems using machines with small main memory. Therefore, both tools rely on input data that is sorted by genomic region which is the grouping predicate for the aggregation. Thus, when reading data from disk, the tools can decide early when all data for a genomic site is read to compute the aggregate and write the result back to disk. Such an approach is not applicable in a main-memory only approach as data and results reside in memory. Moreover, in our approach, we have to perform the join computation due to schema normalization before we can aggregate the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">CONCLUSION &amp; FUTURE WORK</head><p>In this paper, we discuss different techniques for join and aggregation processing in the context of analyzing DNA sequencing data using main-memory database systems. Choosing the optimal implementation combination increases mainmemory efficiency significantly and hence, allows for querying larger genome ranges or data sets.</p><p>Future work. We are aware of improvements regarding memory efficiency of in-memory hash joins <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref>. Therefore, in future work, it might be interesting to use such advanced in-memory hash join techniques. Moreover, we have not investigated the behaviour of sort-merge joins within this work. A second direction in order to increase the scalability of genome analysis tasks in main-memory database systems is to reduce the size of the primary database using lightweight compression, which frees main memory for processing. Finally, partitioning strategies have to be investigated.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The base-centric database schema explicitly encodes every single bases of a read and genome region (contig) allowing for direct access via SQL.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Breakdown of runtime performance of SNP calling over complete low coverage human genome data set. Invisible join in combination with array-based aggregation outperforms the baseline by factor 3.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Scalability of different implementation combinations when calling SNPs on chromosome 1 of a high coverage human genome data set. Using array-based aggregation and invisible join, we can query larger genome ranges.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/ data/HG00096/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Column-stores vs. row-stores: How different are they really?</title>
		<author>
			<persName><forename type="first">D</forename><surname>Abadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Madden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Hachem</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGMOD</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="967" to="980" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Memory-efficient hash joins</title>
		<author>
			<persName><forename type="first">R</forename><surname>Barber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lohman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Pandis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. VLDB Endow</title>
				<meeting>VLDB Endow</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="353" to="364" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Mcjoin: A memory-constrained join for column-store main-memory databases</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Begley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-P</forename><forename type="middle">P</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGMOD</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="121" to="132" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Robust query processing in co-processor-accelerated databases</title>
		<author>
			<persName><forename type="first">S</forename><surname>Breß</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Funke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Teubner</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
			<publisher>SIGMOD</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Genome sequence analysis with monetdb</title>
		<author>
			<persName><forename type="first">R</forename><surname>Cijvat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Manegold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kersten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Datenbank-Spektrum</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">17</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A framework for variation discovery and genotyping using next-generation DNA sequencing data</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Depristo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Banks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Poplin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nat Genet</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="491" to="498" />
			<date type="published" when="2011-05">May 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Toward efficient and reliable genome analysis using main-memory database systems</title>
		<author>
			<persName><forename type="first">S</forename><surname>Dorok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Breß</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Läpple</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Saake</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SSDBM</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page">4</biblScope>
			<date type="published" when="2014">2014</date>
			<publisher>ACM</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Toward efficient variant calling inside main-memory database systems</title>
		<author>
			<persName><forename type="first">S</forename><surname>Dorok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Breß</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Saake</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">BIOKDD-DEXA. IEEE</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Flexible analysis of plant genomes in a database management system</title>
		<author>
			<persName><forename type="first">S</forename><surname>Dorok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Breß</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Teubner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Saake</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EDBT</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="509" to="512" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Base-calling of automated sequencer traces using phred. II. Error probabilities</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ewing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Green</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title/>
	</analytic>
	<monogr>
		<title level="j">Genome Research</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="186" to="194" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Facing the genome data deluge: efficiently identifying genetic variants with in-memory database technology</title>
		<author>
			<persName><forename type="first">C</forename><surname>Fähnrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schapranow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Plattner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SAC</title>
		<imprint>
			<biblScope unit="page" from="18" to="25" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Generating code for holistic query evaluation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Krikellas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Viglas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cintra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDE</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="613" to="624" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">The sequence alignment/map format and samtools</title>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Handsaker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wysoker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">16</biblScope>
			<biblScope unit="page" from="2078" to="2079" />
			<date type="published" when="2009-08">Aug. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A survey of sequence alignment algorithms for next-generation sequencing</title>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Homer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Briefings in Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="473" to="483" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Cancer risks for BRCA1 and BRCA2 mutation carriers: Results from prospective analysis of EMBRACE</title>
		<author>
			<persName><forename type="first">N</forename><surname>Mavaddat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Peock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Frost</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the National Cancer Institute</title>
		<imprint>
			<biblScope unit="page">95</biblScope>
			<date type="published" when="2013-04">Apr. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Genotype and SNP calling from next-generation sequencing data</title>
		<author>
			<persName><forename type="first">R</forename><surname>Nielsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Paul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Albrechtsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">S</forename><surname>Song</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nat. Rev. Genet</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="443" to="451" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers</title>
		<author>
			<persName><forename type="first">M</forename><surname>Quail</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Coupland</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Genomics</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">341</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">The 1000 Genomes Project Consortium. A global reference for human genetic variation</title>
	</analytic>
	<monogr>
		<title level="m">Memory efficient processing of DNA sequences in relational main-memory database systems</title>
				<imprint>
			<date type="published" when="2015-09">7571. Sept. 2015</date>
			<biblScope unit="volume">526</biblScope>
			<biblScope unit="page" from="68" to="74" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
