<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Model-Driven Integration of Compression Algorithms in Column-Store Database Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Juliana Hildebrandt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dirk Habich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wolfgang Lehner</string-name>
          <email>wolfgang.lehnerg@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universitat Dresden, Database Systems Group, WWW home page: https://wwwdb.inf.tu-dresden.de</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Modern database systems are very often in the position to store their entire data in main memory. Aside from increased main memory capacities, a further driver for in-memory database systems was the shift to a decomposition storage model in combination with lightweight data compression algorithms. Using both mentioned storage design concepts, large datasets can be held and processed in main memory with a low memory footprint. In recent years, a large corpus of lightweight data compression algorithms has been developed to e ciently support di erent data characteristics. In this paper, we present our novel model-driven concept to integrate this large and evolving corpus of lightweight data compression algorithms in column-store database systems. Core components of our concept are (i) a uni ed conceptual model for lightweight compression algorithms, (ii) specifying algorithms as platform-independent model instances, (iii) transforming model instances into low-level system code, and (iv) integrating low-level system code into a storage layer.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With an ever increasing volume of data in the era of Big Data, the storage
requirement for database systems grows quickly. In the same way, the pressure
to achieve the required processing performance increases, too. For tackling both
aspects in a consistent and uniform way, data compression plays an important
role. On the one hand, compression drastically reduces storage requirements.
On the other hand, compression also is the cornerstone of an e cient processing
capability by enabling in-memory technologies. This aspect is heavily utilized in
modern in-memory database systems [
        <xref ref-type="bibr" rid="ref1 ref23">1, 23</xref>
        ] based on a decomposition storage
model (DSM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For the compression of sequences of integer values like applied
in DSM compression or in the compression of posting lists in the context of
information retrieval, a large corpus of lightweight data compression algorithms
has been developed to e ciently support di erent data characteristics especially
of sequences of integer values. Examples of lightweight compression techniques
are: frame-of-reference (FOR) [
        <xref ref-type="bibr" rid="ref23 ref8">8, 23</xref>
        ], delta coding (DELTA) [
        <xref ref-type="bibr" rid="ref15 ref19">15, 19</xref>
        ], dictionary
compression (DICT) [
        <xref ref-type="bibr" rid="ref1 ref23">1, 23</xref>
        ], bit vectors (BV) [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], run-length encoding (RLE)
[
        <xref ref-type="bibr" rid="ref1 ref19">1, 19</xref>
        ], and null suppression (NS) [
        <xref ref-type="bibr" rid="ref1 ref19">1, 19</xref>
        ]. These algorithms achieve good
compression rates and they provide fast compression as well as decompression. The
corpus evolves further because it is impossible to design an algorithm that
always produces optimal results for all kinds of data. As shown, e.g., in [
        <xref ref-type="bibr" rid="ref1 ref23 ref24">1, 23, 24</xref>
        ],
the query performance gain for analytical queries is massive.
      </p>
      <p>From the database system architecture perspective, the most challenging
task is now to de ne an approach allowing us to integrate this large and
evolving corpus of lightweight compression algorithms in an e cient way. The nave
approach would be to natively implement the algorithms in the storage layer of
an in-memory database system as done today. However, this nave approach has
several drawbacks, e.g., (1) massive e ort to implement every possible lightweight
compression algorithm as well as (2) database developers or even users are not
able to integrate a speci c compression algorithm without a deep system
understanding and implementing the algorithm on a system level. Generally, users
know their data and could design appropriate compression algorithms, however,
database developers have to preserve the generality of the system and cannot
implement many algorithms for a small range of applications. To overcome this
situation, we propose our novel model-driven approach to easily and automatically
integrate almost every lightweight data compression algorithm in an in-memory
DSM storage layer. In detail, we address the following points:
1. We start with a brief solution overview in Section 2. As we are going to show,
our solution consists of four components: (i) uni ed conceptual model for
lightweight compression algorithms, (ii) description approach for algorithms
as model instances, (iii) transforming model instances to executable code
and (iv) integration of generated code into the storage layer.
2. Based on this solution overview, we summarize our conceptual model for
lightweight compression algorithms in Section 3. This speci c conceptual
model helps in describing, understanding, communicating, and comparing
the algorithms on a conceptual level.
3. Then, we introduce our description language enabling database developers
to specify algorithms in a platform-independent form based on our model.
4. These platform-independent model instances are the foundation for database
system integration as described in Section 5. Here, we introduce our approach
to transform the platform-independent instances to executable algorithms.
5. We conclude with related work and a summary in Section 6 and 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Solution Overview</title>
      <p>
        Fundamentally, the model-driven architecture (MDA) is a software design
approach for the development of software systems [
        <xref ref-type="bibr" rid="ref14 ref21">14, 21</xref>
        ]. In the MDA approach,
the system functionality is de ned with a platform-independent model (PIM)
using an appropriate domain-speci c language [
        <xref ref-type="bibr" rid="ref14 ref21">14, 21</xref>
        ]. Then, the PIM is
translated into one or more platform-speci c models (PSMs) that can be executed
[
        <xref ref-type="bibr" rid="ref14 ref21">14, 21</xref>
        ]. The MDA paradigm is widely used in the area of database applications
for database creations. On the one hand, the model-driven data modeling and
the generation of normalized database schemas should be mentioned. On the
      </p>
      <sec id="sec-2-1">
        <title>In-Memory</title>
      </sec>
      <sec id="sec-2-2">
        <title>Processing Layer</title>
      </sec>
      <sec id="sec-2-3">
        <title>Persistence</title>
      </sec>
      <sec id="sec-2-4">
        <title>Durability Layer</title>
      </sec>
      <sec id="sec-2-5">
        <title>Storage Compression</title>
      </sec>
      <sec id="sec-2-6">
        <title>Format Format</title>
      </sec>
      <sec id="sec-2-7">
        <title>Storage Layer</title>
      </sec>
      <sec id="sec-2-8">
        <title>Concept</title>
      </sec>
      <sec id="sec-2-9">
        <title>Integration</title>
      </sec>
      <sec id="sec-2-10">
        <title>Instance 1</title>
      </sec>
      <sec id="sec-2-11">
        <title>Instance n</title>
      </sec>
      <sec id="sec-2-12">
        <title>Model-2-Code-Transformer</title>
      </sec>
      <sec id="sec-2-13">
        <title>Code</title>
      </sec>
      <sec id="sec-2-14">
        <title>Instance 1 Database Code Instance n</title>
        <p>
          other hand, there is the generation of full database applications, including the
data schema as well as data layer code, business logic layer code, and even user
interface code [
          <xref ref-type="bibr" rid="ref10 ref18">10, 18</xref>
          ]. Furthermore, the MDA approach has been successfully
applied in the area of data warehouse schema creation [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], as well as the
modeling and generation of data integration processes [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ]. However, there is little
to no work on the utilization of MDA for database system-internal components.
        </p>
        <p>
          As depicted on the left side in Fig. 1, the storage layer of an in-memory
database system usually consists of two important components. The rst
component is the storage format, thereby several formats are proposed. One well-known
format is the N-ary storage model (NSM) storing tuples coherently [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. That
means, the tuple unit is preserved in this storage format. The decomposition
storage model (DSM), proposed in 1985 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], is a second widely utilized format.
The DSM partitions an n-attribute relation vertically into n sub-relations. Each
sub-relation contains two attributes, a logical record id (surrogate) and an
attributed value [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In most cases, the surrogate can be neglected due to the order
of the tuples. That means, sub-relation storage equals to a value-based storage
model in form of a sequence of values. While the NSM storage format is used
in disk-based database systems, the DSM is the preferred layout of in-memory
databases [
          <xref ref-type="bibr" rid="ref1 ref23 ref24 ref4">1, 4, 23, 24</xref>
          ]. Compression is the second component of the storage layer,
thereby a large and evolving corpus of lightweight data compression algorithms
has been developed to e ciently support di erent data characteristics [
          <xref ref-type="bibr" rid="ref1 ref15 ref23">1, 15, 23</xref>
          ].
        </p>
        <p>For this compression component, we have developed an MDA-based approach
as illustrated on the right side in Fig. 1. According to the MDA paradigm,
the lightweight data compression algorithms have to be de ned in a
platformindependent model. To achieve this, we developed a conceptual model for
leightweight compression algorithms called COLLATE for this speci c class of
algorithms. The aim of COLLATE is to provide a holistic and platform-independent
view of necessary concepts including all aspects of data (structure), behavior
(function), and interaction (component interaction). In particular, COLLATE
consists of only ve main concepts and we can specify a basic functional
arrangement of these concepts as a blueprint for every lightweight data
compression algorithm for DSM (see Section 3). That means, the platform-independent
model of a lightweight data compression algorithm is a speci c model instance
of COLLATE expressed in an appropriate language (see Section 4). Then, the
platform-speci c model can be transformed into a platform-speci c executable
code as presented in Section 5. The generated code can be used in the database
system in a straightforward way. In this paper, we focus on data compression
algorithms. The same can be done for the decompression algorithms with a slightly
adjusted conceptual model.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conceptual Model</title>
      <p>
        One of our main challenges is the de nition of a conceptual model for the class of
lightweight data compression algorithms. This is the starting point and anchor of
our approach, since all algorithms can be consistently described with this uni ed
and speci c model in a platform-independent manner. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], we have proposed
an appropriate model called COLLATE and the development of this model in
detail. In the remainder of this section, we brie y summarize its main aspects.
      </p>
      <p>The input for COLLATE is a sequence of uncompressed (integer) values
due to the DSM storage format. The output is a sequence of compressed
values. Input and output data have a logical representation (semantic level) and a
physical representation (bit or encoding level). Through the analysis of the
available algorithms, we have identi ed three important aspects. First, there are only
six basic techniques which are used in the algorithms. These basic techniques
are parameter-dependent and the parameter values are calculated within the
algorithms. Second, a lot of algorithms subdivide the input data hierarchically
in subsequences for which the parameters can be calculated. The following data
processing of a subsequence depends on the subsequence itself. That means, data
subdivision and parameter calculation are the adjustment points and the
application of the basic techniques is straightforward. Third, for an exact algorithm
description, the combination and arrangement of codewords and parameters have
to be de ned. Here, the algorithms di er widely.</p>
      <p>Based on a systematic algorithm analysis, we de ned our conceptual model
for this class of algorithms. The COLLATE model consists of ve main concepts|
or building blocks|being required to transform a sequence of uncompressed
values to a sequence of compressed values:
Recursion: Each model instance includes a Recursion per se. This concept
is responsible for the hierarchical sequence subdivision and for applying the
included concepts in the Recursion on each data subsequence.</p>
      <p>Tokenizer: This concept is responsible for dividing an input sequence into nite
subsequences of k values (or single values).</p>
      <p>Parameter Calculator: The concept Parameter Calculator determines
parameter values for nite subsequences or single values. The speci cation of
the parameter values is done using parameter de nitions.</p>
      <p>Encoder: The third concept determines the encoded form for values to be
compressed at bit level. Again, the concrete encoding is speci ed using functions
representing the basic techniques.
input
sequence
parameter/
set of valid
tokens/...</p>
      <p>Recursion
inputtasielqoufence
Tokenizer
adapted
parameters/
adapted set of
valid tokens/...</p>
      <p>fix
parameters
adapted value
Parameter
Calculator</p>
      <p>calculated parameters
fix
parameters</p>
      <p>Encoder/
Recursion
compressed
value or
compressed
sequence
finitetoskeeqnu:ence
Combiner
encoded
sequence</p>
      <p>Combiner: The Combiner is essential to arrange the encoded values and the
calculated parameters for the output representation.</p>
      <p>In addition to these individual concepts, Fig. 2 illustrates the interactions and
the data ow through our concepts. In this gure, a simple case with only one
pair of Parameter Calculator and Encoder is depicted and can be described
as follows. The input data is rst processed by a Tokenizer. Most Tokenizers
need only a nite pre x of a data sequence to decide how many values to
output. The rest of the sequence is used as further input for the Tokenizer and
processed in the same manner (shown with a dashed line). Moreover, there are
Tokenizers needing the whole ( nite) input sequence to decide how to subdivide
it. A second task of the Tokenizer is to decide for each output sequence which
pair of Parameter Calculator and Encoder is used for the further processing.
Most algorithms process all data in the same way, so we need only one pair of
Parameter Calculator and Encoder. Some of them distinguish several cases, so
that this choice between several pairs is necessary. The nite Tokenizer output
sequences serve as input for the Parameter Calculator and the Encoder.</p>
      <p>Parameters are often required for the encoding and decoding. Therefore, we
de ned the Parameter Calculator concept, which knows special rules
(parameter de nitions) for the calculation of several parameters. Parameters can be
used to store a state during data processing. This is depicted with a dashed line.
Calculated parameters have a logical representation for further calculations and
the encoding of values as well as a representation at bit level, because on the
one hand they are needed to calculate the encoding of values, on the other hand
they have to be stored additionally to allow the decoding.</p>
      <p>
        The Encoder processes an atomic input, where the output of the Parameter
Calculator and other parameters are additional inputs. The input is a token
that cannot or shall not be subdivided anymore. In practice the Encoder mostly
gets a single integer value to be mapped into a binary code. Similar to the
parameter de nitions, the Encoder calculates a logical representation of its input
value and an encoding at bit level using functions. Finally, the Combiner arranges
the encoded values and the calculated parameters for the output representation.
Based on our conceptual model, we are able to specify lightweight data
compression algorithms in a platform-independent way [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. For the speci cation
and our overall MDA-based solution, we further require an appropriate language
approach. While we introduce our language concept in Section 4.1 in general, we
describe an example in Section 4.2.
Instead of developing a completely new language, we decided to use the GNU
Octave1 high-level programming language as a foundation for the functional
behavior of COLLATE. A COLLATE instance is always a Recursion concept.
A Recursion consists of (1) a Tokenizer, (2) a set of pairs of Parameter
Calculator and a Scheme, and (3) a Combiner concept. A Scheme is either
an Encoder or a Recursion. This relationship can be summarized using a
setoriented notation as follows:
      </p>
      <p>M odelInstance 2 Recursion;</p>
      <p>Recursion = T okenizer</p>
      <p>P
P air = P arameterCalculator</p>
      <p>P air</p>
      <sec id="sec-3-1">
        <title>Combiner;</title>
      </sec>
      <sec id="sec-3-2">
        <title>Scheme;</title>
        <p>Scheme = Recursion [ Encoder:
(1)
1 https://www.gnu.org/software/octave/</p>
        <p>As mentioned in the Section 3, concepts contain functions for data
processing. Therefore, we combine functional and object-oriented programming for the
speci cation of an algorithm in order to preserve the component interaction.
According to the class diagram of Fig. 3 and the set-oriented notation, we are
able to specify a COLLATE model instance as an object of the class Recursion
in the Octave high level programming language. The Recursion class contains
(i) a Tokenizer object which is characterized by a function handle, (ii) an array
of pairs and (iii) a Combiner object which is also de ned by a function handle. A
function handle is an anonymous function with the syntax @(argument-list)
expression. The input for a Tokenizer's function handle is (1) a sequence inp
of values and (2) a structure par. This structure contains one eld resp. attribute
for each valid parameter. These attributes might be global parameters that are
valid for each subsequence or each single value, or other calculated parameters.
The output of a Tokenizer's function handle returns a triple of values. The rst
element is the number of output values, the second element is a nite
subsequence of the input sequence and the third element is a number that indicates,
which of the pairs of Parameter Calculator and Scheme is chosen for the
further data processing. A Combiner's function handle has two input values. The
rst one is an array inp of tuples of a compressed sequence or a compressed
value inp.enc which is the output of a Scheme and the corresponding output of
the Parameter Calculator, inp.par. The second one is the structure par that
contains all other parameters that are valid for the whole array inp. The output
of a Combiner is a concatenation of compressed sequences or values.</p>
        <p>Each pair consists of a Parameter Calculator object which is de ned by
several parameter de nition objects and an Encoder resp. a Recursion object.
Each parameter de nition contains a function handle for the calculation of logical
parameters and a function handle for the calculation of the encoding of the
logical parameter. The input for the logical calculation is (1) the subsequence
inp that is an output of the Tokenizer and (2) a structure par of known and
valid parameters. The output is a logical parameter. Its type is not x. Often it
is one single value, but it can be a more complex parameter, i.e., a mapping like
a dictionary. The input of the physical calculation is (1) the logical parameter
inp and (2) all other known and valid parameters par. Its output is the encoding
for the parameter. The function composition of both function handles maps a
nite subsequence to a bit level representation of a parameter that is valid for
the sequence inp and the parameters par. It is the same for the logical and
physical function handles of the encoder.
4.2</p>
        <p>Example algorithm
To illustrate our Octave language approach, we now present a simple example
lightweight data compression algorithm: frame-of-reference with binary packing
for n values (forbp). Here, an input sequence of arbitrary length is subdivided in
subsequences of n values. The minimum is calculated for each subsequence of n
values as the reference value. So, each value of the subsequence can be mapped
to its di erence to the reference value at the logical data level. This technique is
Tokenizer</p>
        <p>k = n
Parameter Calculator
mref = (min(•), bin(•, 32))
bw =
(blog2(max(max(•)−mref, 1))c+1,
bin(•, 32))
pcal_1 = ParameterCalculator([ ]);
enc_1 = Encoder(
@(inp,par) inp(1) - par.ref.log,
@(inp,par) dec2bin(inp,par.bw.log));
pair_1 = Pair(pcal_1, enc_1);
comb_1 = Combiner(</p>
        <p>@(inp,par) reduce(@(b,c) strcat(c.enc,b),inp,""));
rec_n = Recursion(tok_1, [pair_1], comb_1);
pair_n = Pair(pcal_n, rec_n);
comb_n = Combiner(@(inp,par) reduce(
@(b,c) strcat(c.enc,c.par.bw.phys,c.par.ref.phys,b),</p>
        <p>inp,par.n.phys));
forbp = Recursion(tok_n, [pair_n], comb_n);
t
a
li
n
v
a
l
u
e
s
t
a
il
1
v
a
l
u
e
called frame-of-reference (FOR) and it is used to get smaller values, which can be
encoded with a smaller bit width than 32 resp. 64 bits at the physical data level.
The bit level representations of all n deltas can be encoded with a common bit
width. Because the deltas are smaller than the original values, a possible common
bit width might be smaller than 32 resp. 64 bits. This technique is called binary
packing (BP). To guarantee the decodability, the reference value and the used
bit width have to be stored additionally to every sequence of n encoded values.</p>
        <p>Fig. 4 depicts the graphical representation of the corresponding COLLATE
instance at the left side and the Octave code representation at the right side. The
input ( ) is an arbitrary sequence of 32-bit integer values. The block size n, a
further input, serves as a global parameter to design the algorithm more generic.
The whole model instance is an instance of a Recursion concept. It consists
of a Tokenizer, one pair of a Parameter Calculator and a Recursion and a
Combiner. The Tokenizer outputs the rst k := n values. For these n values the
Parameter Calculator determines the minimum as the reference value mref =
min( ) and an appropriate bit width bw = blog2(max(max( ) mref; 1))c + 1 at
the logical data level. In the end, both values are encoded in the output with
32 bits each at the physical data level (denoted by the function bin( ; 32)). So
each parameter de nition consists of one function that addresses the logical data
level and one function that considers the encoding resp. physical data level. The
next Recursion consists of a Tokenizer, one pair of Paramater Calculator
and Encoder, and a Combiner. This Tokenizer outputs single values (k := 1).
The next Parameter Calculator has no task in this instance. Just as parameter
de nitions, the Enocder consists of one function that processes an input value at
the logical level and one function that encodes the value at the physical data level.
The Encoder determines the di erence of the single input value and the reference
value at the logical data level (denoted by c = mref) and the encoding of
this value with the calculated bit width at the physical data level (denoted
by bin( ; bw)). The inner Combiner concatenates the physical representations
of the n values. This is denoted by cn. Physical representations of values are
underlined in the graphical model and power n indicates the concatenation of n
values. The outer Combiner concatenates the n encoded values with the physical
representation of their reference value and the physical representation of their
bit width for all blocks and the bit level representation of the number n.</p>
        <p>The Octave code at the right side contains the same information. In line 2,
a Tokenizer object is de ned. A function handle is input for the constructor.
Its return value is the triple consisting of the number n, the rst n values of
the input sequence and the number resp. address 1. The whole algorithm as the
Recursion object forbp (line 12) is constructed with the Tokenizer (line 2), an
array with one pair (line 12) of a Parameter Calculator (line 5) and Recursion
(line 11) and a Combiner (line 13). The rst calculated parameter is the reference
value mref. The parameter de nition is constructed in line 3 with one function
handle for the logical calculation and a second function handle for the physical
calculation. Each concept can be expressed with one or few lines of Octave code.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>System Integration</title>
      <p>
        The result of the previous two sections is that we are now able to specify
lightweight compression algorithms in a platform-independent way by de
ning model instances with an Octave notation. As depicted in Fig. 4, the rst
system integration is done using a CREATE COMPRESSION statement to
register algorithms in the database system under a user-de ned name, e.g., forbp
as in this example. This system integration continues with (i) an approach to
transform model instances to executable code and (ii) the speci cation of the
application of the compression algorithm. For this application, we extended the
CREATE TABLE-syntax in a straightforward way to allow the speci cation of the
compression algorithm which should be used for each attribute separately:
CREATE TABLE Test (attribute_a int compress forbp(
struct("num",struct("log",128,"phys",dec2bin(128,32))));
In order to execute model instance speci cations inside a database system
(e.g. MonetDB [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), we require an approach to transform model instances to
executable code or platform-speci c code. Fig. 5 depicts our developed overall
approach for this challenge. Generally, we follow a generator approach in our
Model-2-Code Transformer. At the moment, the input is (i) an Octave
specication of a model instance and (ii) code templates for our model concepts. On
the COLLATE model level, we have 5 speci c concepts. That means, we require
one code template for each model concept to generate executable code. The code
Model Level
Model
Instance Level
      </p>
      <p>input</p>
      <sec id="sec-4-1">
        <title>Hardware</title>
      </sec>
      <sec id="sec-4-2">
        <title>Model,</title>
      </sec>
      <sec id="sec-4-3">
        <title>SIMD</title>
        <p>System
Integration Level</p>
      </sec>
      <sec id="sec-4-4">
        <title>COLLATE</title>
      </sec>
      <sec id="sec-4-5">
        <title>Model</title>
      </sec>
      <sec id="sec-4-6">
        <title>Speci cation</title>
      </sec>
      <sec id="sec-4-7">
        <title>Octave</title>
      </sec>
      <sec id="sec-4-8">
        <title>Code</title>
      </sec>
      <sec id="sec-4-9">
        <title>Instance input</title>
      </sec>
      <sec id="sec-4-10">
        <title>Model-2-Code</title>
      </sec>
      <sec id="sec-4-11">
        <title>Transformer</title>
      </sec>
      <sec id="sec-4-12">
        <title>Codegeneration</title>
      </sec>
      <sec id="sec-4-13">
        <title>Code</title>
      </sec>
      <sec id="sec-4-14">
        <title>Instance</title>
        <p>
          templates have to be implemented once for each speci c database system, e.g.,
MonetDB [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This is necessary to get access to data on the speci c storage layer
implementation.
        </p>
        <p>
          Based on this input, our Model-2-Code Transformer generates the
executable code using a replacement and optimization strategy. In this case, the
Octave speci cation is parsed and a corresponding arrangement of the code
templates is constructed. The code templates are enriched with the corresponding
parameter calculations or encoding functions. Then, the code template
arrangement is optimized by applying typical compiler techniques like loop unrolling or
load constant replacement [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The goal of this optimization is to generate code
with less instructions to save as many CPU cycles as possible. This is particularly
crucial for reducing the compression overhead.
        </p>
        <p>
          The generated code can be seamlessly integrated into the speci c database
system since the templates are implemented for the particular system. At the
moment, we are implementing and integrating the whole approach for MonetDB[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ],
whereas our Model-2-Code Transformer is implemented using the LLVM
compiler infrastructure. Unfortunately, the current status does not allow a
meaningful evaluation with regard to performance.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>
        In the previous sections, we explained our overall approach for the integration of
di erent data compression algorithms in a column-store database system. With
our approach, an in-memory column-store system is extendable with regard to
the bit-level data representation on storage layer level. Generally, extensibility
of DBMS was a research eld in the late 80's and early 90's [
        <xref ref-type="bibr" rid="ref20 ref5 ref6">5, 6, 20</xref>
        ]. Here, the
focus was the integration of new user de ned data types, storage methods and
indexes, but not the extensibility with regard to new algorithms on top of storage
formats. Furthermore, there is also work available focussing on code generation in
database systems. For example, Neumann has proposed a completely new query
processing architecture [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Here, a code generator is used to build specialized
query execution plans on demand based on operator templates. He also used the
LLVM framework to merge operations into one machine code function. In our
work, we follow a similar approach for lightweight data compression algorithms.
An interesting research direction would be to combine both approaches for a
compression-aware query processing [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
7
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        The e cient storage and processing of large datasets is a challenge in the era of
Big Data. To tackle this challenge, data compression plays an important role from
a system-level perspective. Aside from drastically reducing the storage
requirement, data compression also enables an e cient processing using "in-memory"
technologies. In order to do justice to that, we have presented a model-based
approach to integrate a large and evolving corpus of lightweight data
compression algorithms in column-store database systems like MonetDB [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in this
paper. Generally, our approach consists of four components: (i) uni ed conceptual
model for lightweight compression algorithms, (ii) description approach for
algorithms as model instances, (iii) transforming model instances to executable
code and (iv) integration of generated code into the storage layer. In this paper,
we concentrated on data compression. The same is also possible for data
decompression. In our further research activities, we want to shift our focus to di erent
storage formats to cover not only column-store database systems, but also regard
other storage formats, for example row-stores, but our overall approach works.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was partly funded (1) by the German Research Foundation (DFG) in
the context of the project "Lightweight Compression Techniques for the
Optimization of Complex Database Queries" (LE-1416/26-1) and (2) by the German
Federal Ministry of Education and Research (BMBF) in EXPLOIDS project
under grant 16KIS0523.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferreira</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>Integrating compression and execution in column-oriented database systems</article-title>
          . In: In SIGMOD. pp.
          <volume>671</volume>
          {
          <issue>682</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Bohm,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Wloka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            ,
            <surname>Habich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Lehner</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          :
          <article-title>Model-driven generation and optimization of complex integration processes</article-title>
          .
          <source>In: ICEIS</source>
          . pp.
          <volume>131</volume>
          {
          <issue>136</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Bohm,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Wloka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            ,
            <surname>Habich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Lehner</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.:</surname>
          </string-name>
          <article-title>GCIP: exploiting the generation and optimization of integration processes</article-title>
          .
          <source>In: EDBT</source>
          . pp.
          <volume>1128</volume>
          {
          <issue>1131</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Boncz</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kersten</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manegold</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Breaking the memory wall in monetdb</article-title>
          .
          <source>Commun. ACM</source>
          <volume>51</volume>
          (
          <issue>12</issue>
          ),
          <volume>77</volume>
          {
          <fpage>85</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Carey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Extensible database management systems</article-title>
          .
          <source>SIGMOD Rec</source>
          .
          <volume>19</volume>
          (
          <issue>4</issue>
          ),
          <volume>54</volume>
          {60 (Dec
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Carey</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DeWitt</surname>
          </string-name>
          , D.J.,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muralikrishna</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graefe</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shekita</surname>
            ,
            <given-names>E.J.:</given-names>
          </string-name>
          <article-title>The architecture of the exodus extensible dbms</article-title>
          .
          <source>In: OODS</source>
          . pp.
          <volume>52</volume>
          {
          <issue>65</issue>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Copeland</surname>
            ,
            <given-names>G.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khosha an</surname>
            ,
            <given-names>S.N.:</given-names>
          </string-name>
          <article-title>A decomposition storage model</article-title>
          .
          <source>SIGMOD Rec</source>
          .
          <volume>14</volume>
          (
          <issue>4</issue>
          ),
          <volume>268</volume>
          {279 (May
          <year>1985</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaft</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Compressing relations and indexes</article-title>
          .
          <source>In: ICDE</source>
          . pp.
          <volume>370</volume>
          {
          <issue>379</issue>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Habich</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Damme</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehner</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Optimierung der Anfrageverarbeitung mittels Kompression der Zwischenergebnisse</article-title>
          . In: BTW. pp.
          <volume>259</volume>
          {
          <issue>278</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Habich</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richly</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehner</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Gignomda - exploiting cross-layer optimization for complex database applications</article-title>
          . In: VLDB (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Hansch,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Kissinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Habich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Lehner</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          :
          <article-title>Plan operator specialization using re ective compiler techniques</article-title>
          .
          <source>In: BTW</source>
          . pp.
          <volume>363</volume>
          {
          <issue>382</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hildebrandt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Habich</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Damme</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehner</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>COLLATE - a conceptual model for lightweight data compression algorithms</article-title>
          .
          <source>Tech. rep., Technische Universitaet Dresden</source>
          , Database Systems Group (
          <year>2016</year>
          ), https://goo.gl/SgXm5z
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chou</surname>
            ,
            <given-names>H.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banerjee</surname>
          </string-name>
          , J.:
          <article-title>Operations and implementation of complex objects</article-title>
          .
          <source>In: ICDE</source>
          . pp.
          <volume>626</volume>
          {
          <issue>633</issue>
          (
          <year>1987</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kleppe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warmer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bast</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          : MDA Explained.
          <article-title>The Model Driven Architecture: Practice and Promise</article-title>
          .
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lemire</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boytsov</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Decoding billions of integers per second through vectorization</article-title>
          . Softw.,
          <string-name>
            <surname>Pract</surname>
          </string-name>
          . Exper.
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>29</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Neumann</surname>
          </string-name>
          , T.:
          <article-title>E ciently compiling e cient query plans for modern hardware</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>4</volume>
          (
          <issue>9</issue>
          ),
          <volume>539</volume>
          {550 (Jun
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. OMG:
          <article-title>Common Warehouse Metamodel (CWM), Version 1</article-title>
          .0 (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Richly</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Habich</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehner</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Gignomda - generation of complex database applications</article-title>
          .
          <source>In: Grundlagen von Datenbanken</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horn</surname>
            ,
            <given-names>S.J.V.</given-names>
          </string-name>
          :
          <article-title>Database compression</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>22</volume>
          (
          <issue>3</issue>
          ),
          <volume>31</volume>
          {
          <fpage>39</fpage>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Schwarz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freytag</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lohman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McPherson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pirahesh</surname>
          </string-name>
          , H.:
          <article-title>Extensibility in the starburst database system</article-title>
          .
          <source>In: OODS</source>
          . pp.
          <volume>85</volume>
          {
          <issue>92</issue>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barry</surname>
            ,
            <given-names>B.M.:</given-names>
          </string-name>
          <article-title>Model driven development: the case for domain oriented programming</article-title>
          .
          <source>In: OOPSLA</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <source>Adaptive Data Compression</source>
          . Kluwer international series in engineering and computer science:
          <source>Communications and information theory</source>
          , Springer US (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Zukowski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boncz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Super-scalar RAM-CPU cache compression</article-title>
          .
          <source>In: ICDE</source>
          . p.
          <volume>59</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Zukowski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boncz</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          :
          <article-title>DSM vs</article-title>
          .
          <article-title>NSM: CPU performance tradeo s in block-oriented query processing</article-title>
          .
          <source>In: DaMoN</source>
          . pp.
          <volume>47</volume>
          {
          <issue>54</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>