<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Programmability of a NUMA-aware Storage Engine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dirk Habich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Schad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Kissinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wolfgang Lehner</string-name>
          <email>wolfgang.lehnerg@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universitat Dresden, Database Systems Group, WWW home page: https://wwwdb.inf.tu-dresden.de</institution>
        </aff>
      </contrib-group>
      <fpage>391</fpage>
      <lpage>402</lpage>
      <abstract>
        <p>The SQL database language was originally intended for application programmers. However, after more than 20 years of language extensions, SQL can only be generated by software components and is no longer suitable for an increasing user base like knowledge workers or data scientists, who want to work with data in an interactive fashion. The original idea of declarative query languages, telling the system what information to retrieve and not how to retrieve it, is still relevant. However, procedural elements are extremely worthwhile and have to be part of a next generation database programming language without compromising performance and scalability. To tackle this challenge, we are going to present our overall approach consisting of a highly-scalable NUMA-aware storage engine ERIS and a novel appropriate procedural programming approach on top of ERIS in this paper.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In the era of Big Data, the requirements for data management systems are
manifold, whereas performance and scalability are still the two most important
technical requirements. Aside from these technical requirements, the relevance of the
usability aspect increases. Generally, all these aspects are not new and already
well-established in the database community over a long period. While most
research work has been focused on solutions for satisfying the performance and
scalability aspects, the usability aspect has been singularly addressed in various
work as well e.g., the map-reduce programming concept or the PACT
programming model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] come to mind. From our point of view, the most challenging issue
for tomorrow's data management systems is the consolidation of technical and
usability aspects, taking programmability into account.
      </p>
      <p>
        As already mentioned in the Claremont [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as well as in the Beckman [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
report on database research, the user base for DBMS is rapidly growing and new
users bring new expectation with regard to programmability or programming
abstractions against very large data sets. Today, most of the users are unhappy
(i) with the o ered user interfaces and (ii) with the heavyweight system
architecture. While the declarative way of SQL is often perceived as too restricted,
procedural opportunities like stored procedures are too complex. Furthermore,
the extensibility of traditional database systems using procedural opportunities
is limited, which has an in uence on usability as well as performance. Therefore,
traditional DBMS are often degraded to storage units today and enhanced
functionality is implemented on top or in competing data processing systems like
Hadoop. However, with increasing data volumes, exporting massive data sets
from the database and conduction data-intensive processing in user applications
is no longer a valid opportunity. Tomorrow's applications will have to push-down
their procedural logic into the database to work close to the data.
      </p>
      <p>
        In order to tackle technical and usability requirements for database systems
in a uni ed way to satisfy a growing user base like knowledge workers and data
scientists, we are pursuing a holistic approach with novel and unique features:
NUMA-aware Storage Engine: The ever-growing demand for more
computing power forces hardware vendors to put an increasing number of
multiprocessors into a single server system, which usually exhibits a non-uniform
memory access (NUMA). Based on that hardware foundation, we developed a new
NUMA-aware in-memory storage engine ERIS that is based on a data-oriented
architecture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In contrast to existing approaches that focus on transactional
workloads on a disk-based DBMS, ERIS aims at tera-scale analytical workloads
that are executed entirely in main memory. ERIS uses an adaptive data
partitioning approach exploiting the topology of the underlying NUMA platform
and signi cantly reduces NUMA-related issues. As we have shown in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we
achieve more than a linear speedup for index lookups and scalable parallel scan
operations.
      </p>
      <p>Programmability: Our NUMA-aware in-memory storage engine currently
supports three basic storage operations that are required to run analytical queries:
scan, lookup, and insert/upsert. Aside from reading operations, fast writing
operations are necessary, especially to materialize large intermediate results. Based
on those storage operations, we are currently developing a new procedural user
interface, so that users are able to program their analytical tasks. In this paper,
our primary goal is to introduce the procedural user interface and show how it
can be used to add support for SQL functionality on ERIS . On the one hand,
this new user interface facilitates easy and abstract programming in a procedural
fashion and for a broad user base. On the other hand, the procedural analytical
tasks are e ciently executable on our NUMA-aware storage engine ERIS .</p>
      <p>The remainder of the paper is structured as follows: In the following
section, we brie y review our NUMA-aware in-memory storage system. Then, we
introduce our programmability concept using partitioning schemes as rst-class
citizen operators in Section 3. In Section 4, we discuss some execution
perspectives. We conclude the paper, with some remarks regarding related work, future
work and short conclusion.
&gt; Arch &amp; Worker
Load
Balancer</p>
      <p>NUMA-Optimized High-Throughput Data Command Routing</p>
      <p>Monitoring
AEU … AEU
Core1 CoreN</p>
      <p>LocalMemoryManager</p>
      <p>LocalMemory
Multiprocessor1
…
Partition
Transfer</p>
      <p>AEU … AEU
Core1 CoreN</p>
      <p>LocalMemoryManager</p>
      <p>LocalMemory
MultiprocessorM</p>
      <p>Autonomous Execution Unit (AEU)</p>
      <p>GroupDataCommands
ProcessDataCommands
(i.e. ,Scan,Lookup,and</p>
      <p>Insert/Upsert)
ProcessBalancing</p>
      <p>Commands</p>
      <p>Local Memory
Local Command Buffer</p>
      <p>AEU’sPartitions
Index</p>
      <p>
        ColumnStore
In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we introduced our NUMA-aware all-in-memory storage engine for
terascale analytical workloads. Fundamentally, NUMA systems consist of several
interconnected multiprocessors (also denoted as nodes). Each multiprocessor
contains multiple processing units (cores) and an integrated memory controller
(IMC). Although the installed main memory is distributed across the IMCs of
the di erent multiprocessors, each multiprocessor can access each memory
location of the system. Therefore, latency and bandwidth of memory accesses depend
on the distance between the requesting multiprocessor and the multiprocessor
which has the data in local memory. The local memory associated with each
multiprocessor is accessed with low latency at a high bandwidth. In contrast,
remote memory is accessed via point-to-point connections between the
multiprocessors that add latency and limit the achievable bandwidth. As we have shown
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the latency of remote access is approximately 10 times higher and the
bandwidth is limited to about 11% compared to local accesses in the worst case.
To e ciently tackle the NUMA-speci c properties from a performance
perspective for analytical workloads, our in-memory storage engine ERIS is treated like a
distributed system using a data-oriented architecture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] where each data object
is logically partitioned. Figure 1 shows the overall architecture of ERIS and its
individual components. The central components are the worker threads, which
we call Autonomous Execution Units (AEU). Each core, respectively hardware
context, of a NUMA system runs exactly one AEU and an AEU never leaves
its core. All AEUs that are pinned on the same multiprocessor use a common
memory manager, because they share the same local main memory. Every AEU
gets assigned a set of disjoint data partitions |each belonging to a di erent
data object| which it stores in local memory. Further, every AEU holds
exclusive access rights to its partitions. This approach restricts memory accesses of an
AEU to the multiprocessor's local main memory and data objects do not have to
be protected against concurrent accesses via latches. ERIS primarily uses range
partitioning to split data objects (e.g., tables) into partitions.
      </p>
      <p>On the right hand side of Figure 1, we depict the AEU loop as well as the
local memory organization of an AEU. Each AEU maintains local data command
bu ers and the actual data object partitions (either stored as a column-store, a
row-store, or an index). In the rst stage of the loop, the AEU scans its data
command bu er (i.e., scan, lookup, or insert/upsert), which is periodically lled
by the routing layer, and groups commands by the accessed data object and the
command type. This optimization step allows to coalesce the same type of
access to the same partition. For instance, an AEU is able to execute multiple scan
commands on the same partition with a single scan and is thereby
implementing scan sharing in combination with MVCC to ensure isolation. Moreover, the
command grouping allows us to execute multiple index lookup or insert/upsert
operations in a single batch operation to hide the main memory latency.
Following the grouping step, the AEU actually processes its data command bu er,
which is the most time consuming part of the loop. Afterwards, the AEU checks
its command bu er for pending balancing or transfer commands. Such
commands force an AEU to grow or shrink its partition or to transfer a range of its
partition to another AEU.</p>
      <p>NUMA-Optimized High-Troughput Data Command Routing: As shown
in Figure 1, the data command routing is the most essential part of ERIS,
because AEUs have to be supplied with data commands just in time. Especially
during the execution of analytical queries, large amounts of data commands have
to be routed between AEUs (e.g., lookup operations during a join). Thus, the
main goal of the data command routing is to distribute data commands at a
high throughput. A data command consists of a storage operation type (i.e.,
scan, lookup, or insert/upsert), a data object identi er, a reference to a
callback function, a data segment that contains all the necessary parameters for the
storage operation (e.g., a batch of keys for the lookup or lters for a scan), and
additional data that is necessary for the query processing.</p>
      <p>Load Balancing Besides data command routing, ERIS owns a NUMA-aware
load balancer component to adapt the data partitioning to a changing workload.
Since ERIS aims at analytical workloads, the maximization of parallel processing
is the main objective of the load balancer. Thus, there is no need for
inter-dataobject balancing, for instance to colocate certain partitions of di erent data
objects on the same AEU as it is bene cial for transactional workloads. The
ERIS adaption loop starts with the monitoring of di erent metrics on a per
data object level. Based on the captured metrics, the load balancer periodically
checks the load of ERIS for imbalances. If the standard deviation between the
di erent AEUs exceeds a given threshold, the load balancer executes a load
balancing algorithm that calculates a new target partitioning. With the help of
the current and the targeted partitioning, the load balancer computes a series
of balancing commands that are routed to the involved AEUs.
2.2</p>
      <sec id="sec-1-1">
        <title>Summary</title>
        <p>
          ERIS is a NUMA-aware purely in-memory storage engine for tera-scale
analytical workloads that is based on a data-oriented architecture. ERIS uses a
NUMA-optimized high-throughput data command routing as well as a con
gurable NUMA-aware load balancing algorithm to achieve a maximum of
parallelism to execute analytical queries with low latency. Our analysis in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] shows
that ERIS greatly improves memory locality and cache usage and thus scales
even on large-scale NUMA platforms. Since ERIS only provides storage
operation primitives, the next step is to implement a query processing framework on
top of ERIS and to evaluate the performance of more complex queries. Query
processing with ERIS requires techniques for distributed systems and poses
additional challenges for load balancing. In particular, data partitioning is a key
aspect to enable data parallelism for processing.
3
        </p>
        <p>ERIS -Programmability for SQL Functionality
In order to support the creation of complex analytical workloads, ERIS needs
a exible but easy to use interface for its storage and processing capabilities.
Instead of specializing ERIS to one speci c database interface, we equip it with
a general purpose programming interface which can be used as the basis for an
implementation of di erent high-level interfaces like SQL. Many state-of-the-art
databases rely on a set of hardcoded physical operators to execute SQL
statements. Thereby, a SQL statement is transformed into a graph representation,
the query plan, and that graph representation is subsequently interpreted to
call adequate physical operators. ERIS , on the other hand, does not limit data
processing to a xed set of relational operators. Instead, ERIS provides a means
to run user provided code on data but enforces a certain structuring of user
code and control ow in order to enable e cient and parallel execution. In order
to ideally exploit ERIS ' data processing capabilities, the ERIS programming
framework's main design objective is to support data locality and parallel
execution of computations. To enable that, we require for each user function, an
explicit description of the data partitions the user function has to be granted
access to.
3.1</p>
      </sec>
      <sec id="sec-1-2">
        <title>Partitioning Schemes</title>
        <p>The ERIS programming framework is built around three data structures for the
desired SQL functionality: tuples, relations, and partitions. Just like in
typical databases, tuples are at containers of data types like numbers, strings
or dates. Both relations and partitions are sets of tuples with a xed schema;
but only relations can be named and loaded from ERIS by name. There is
no direct way to access the tuples of a relation. Instead relations have to be
partitioned rst and subsequently the individual parts of the relation can be
processed by accessing their tuples. Requiring relations to be partitioned before
their tuples can be accessed, is the key to enabling data parallel processing in
ERIS .</p>
        <p>The current version of our ERIS programming framework supports three
partitionings from a programming perspective as rst-class citizens, which are
necessary to support SQL functionality.
Individuals: The rst partitioning scheme is called individuals partitioning
and it simply turns every tuple of the target relation into its own partition. In
equation 1, we de ne the individuals partitioning as a family of sets over a base
relation R1 : : : Rn where each element of the base relation forms its own
subset.</p>
        <p>Ind(R) = f ftg j t 2 R g
(1)
Equivalents: The second partitioning groups tuples having the same value in
a xed set of attributes. Again, the equivalents partitioning is a family of sets.
This time, a subset contains all tuples which are equivalent to a representative
tuple. The representative tuple is drawn from a selection on the base relation
of the partitioning. Two tuples are considered equivalent if they have the same
value in all attributes that they have in common.</p>
        <p>Equiv(R; a1; : : : ; ak)
=
f ft j t 2 R ^ t
c g j c 2
a1;:::;ak (R) g
(2)
All: The nal partitioning does not partition its input relation at all but simply
declares the whole relation as a single partition. All is needed because some
algorithms do not lend themselves to the form of data parallelism promoted by
partitionings. This is especially true for all kinds of aggregations that fold a
complete relation into a single tuple or value.</p>
        <p>All(R) = f R g
(3)
3.2</p>
      </sec>
      <sec id="sec-1-3">
        <title>Processing Functions for Partitioning Schemes</title>
        <p>Once a relation is partitioned using one of our three partitioning schemes, the
individual parts can be used for processing. The processing of tuples has to be
encapsulated in a pure function, the so called user function which takes one
or multiple partitions as parameter and returns a single partition. The
framework implements this concept by embedding tuple processing capabilities in
process(P1; : : : ; Pn; f nn), a second-order function which is parametrized with a
list of input partitionings and a rst-order user function. The arity of the user
function always has to match the number of partitionings supplied to process.</p>
        <p>In the simple case, process(P; f n1) is applied to a single relation P and a
user function f n1 with arity 1. In that case, process applies f n1 to each subset
of partition P , creates a new relation by unioning the results of each application
of f n1, and nally returns the new relation as overall result of the second-order
function.</p>
        <p>process(P1; : : : ; Pn; f nn) =
f (p1; : : : ; pn)</p>
        <p>(4)
[ : : : [
p12P1
pn2Pn</p>
        <p>Equation 4 shows the generalization of process to multiple input partitions.
In this case, process(P1; : : : ; Pn; f nn) applies f nn to all possible combinations
of subsets of the input partitionings. Again, the results of all applications are</p>
        <p>Listing 1.1. Function for computing the relational projection
def p r o j e c t i o n ( r : R e l a t i o n , a t t r s : A t t r i b u t e ) : R e l a t i o n = f
val d = p r o c e s s ( r . i n d i v i d u a l s ) f p a r t i t i o n =&gt;
val t u p l e s = f o r ( t u p l e &lt; p a r t i t i o n ) y i e l d f</p>
        <p>new Tuple ( a t t r s . map( a =&gt; t u p l e ( a ) )
unioned into a new relation which is the nal result of process(P1; : : : ; Pn; f nn).
Independent of the number of partitionings, f nn has to map its inputs to a
single output schema or it will break the unioning of output partitions. In this
way, process(P1; : : : ; Pn; f nn) completes the story of data-parallelism in ERIS by
enabling the parallel processing of multiple elements of one or more partitionings.
To demonstrate the adequacy of our presented framework, we use it to outline
an implementation of the relational algebra on ERIS , whereas our language
approach is based on Scala.</p>
        <p>Projection: P = a1;:::;ak (R) The relational projection requires a two step
approach: rst unwanted attributes are removed from the individual tuples and
then duplicates are removed from the resulting relation.</p>
        <p>Listing 1.1 shows the Scala code for that operation. The rst process is
applied to the individuals of the target relation R. The user function simply
iterates the partition, extracts the relevant attributes from each tuple and uses
those attributes to create a new tuple. Finally, all tuples are combined into a
new partition and that partition is returned as a result of the user function.
It has to be noted that in this case, because the partition only ever contains
a single tuple, the more convenient parameter to the user function would be a
simple tuple instead of a partition. But for the sake of generality we avoid the
introduction of special cases at this point.</p>
        <p>The second process application is necessary to eliminate duplicates which can
be introduced when attributes are removed from relations. Duplicate elimination
is not an automatic feature of the underlying relation data structure as it
mapped onto ERIS containers which do not exhibit this feature. The idea is to
perform an equivalents partitioning on all attributes of the relation. In this setup
each partition can only hold either a single tuple or multiple duplicated tuples</p>
        <p>Listing 1.2. Function for computing the cartesian product
as all tuples in the partition have to be equal in all attributes. Most of the work
of removing duplicates is of course done by the partitioning function itself. The
user function simply selects a random element of the input partition and creates
a new partition with the one element as only content.</p>
        <p>Selection: S = P (R) The selection can be performed on a per tuple basis
and therefore, a single processing of individuals is su cient. The user function
tests for each tuple whether it satis es the predicate and creates the output
partition with all tuples that have passed the test.</p>
        <p>Cartesian Product: P = R S Due to process' support for multiple
input partitionings, the implementation of the cartesian product in listing 1.2 is
straightforward. process is applied to the individuals partitionings of the
relations R and S. Therefore, the user function is applied once on every possible
pair of input tuples and merely has to combine the attributes of these tuples
into a single output tuple and pack that tuple into a new partition.
Union: U = R [ S Set union is a typical example of an operation that is hard
to parallelize with the partitioning approach as there is minimal independent
per element computation. Listing 1.3 shows a simple solution relying on the all
partitioning and simply concatenates all tuples of R with all tuples of S. As with
the relational projection, this process can introduce duplicate entries which have
to be removed in a second step.</p>
        <p>Di erence: D = RnS The set di erence operator can be implemented as
multiple parallel set inclusion tests using an individuals and an all partitioning.
In listing 1.4 each tuple of R is combined with all tuples of S and the user function
performs a single set inclusion test to check if the tuple has to be included in
the output relation.
4</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Execution and Complex Query Example</title>
      <p>
        Our ERIS programming framework follows recent proposals [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ] to connect best
performance with high programmer productivity by relying on domain speci c
code generation. As already shown in the previous section, users of our
framework write their programs in Scala and rely on a set of framework operations
to interact with ERIS ' storage and processing resources. The framework's
operations are our partitioning schemes. In Section 4.1, we introduce our execution
approach, followed in Section 4.2 by a short example of a complex query.
4.1
      </p>
      <sec id="sec-2-1">
        <title>Execution</title>
        <p>The overall compilation of programs written in Scala and using our ERIS
framework is driven by the translation of partitionings and the process(P1; : : : ; Pn; f nn)
function as these constructs are directly mapped onto invocations of ERIS '
native storage operations like ScanTable to scan a table. The ScanTable operation
is responsible for all reading activitivies in ERIS , whereas the ScanTable can
be parameterized with a single data container and a callback function being
called whenever it has collected a chunk of records from the target container.
The callback processes the records and pushes the resulting data to a new
container using an insert operation. ScanTable invokes the callback with chunks of
records until all records of the target container have been processed.</p>
        <p>A single execution of ScanTable in its basic form is su cient to translate
an execute over a single individuals partitioning. The ScanTable's callback is
created with a loop that iterates the individual records of the input chunk. The
body of the loop contains the translation of the user function, which is thereby
applied on each individual record. At the end of each loop, an insert message
materializes the result of the loop body into an output container.</p>
        <p>The translation of an execute over an equivalents or an all partitioning of
a single relation is of a similar character. ScanTable can be parametrized to
call its callback either with all records that ful ll certain requirements or simply
with all records of a container. These parametrization options exactly match
the semantics of the equivalents and all partitionings, so the translation of both
partitionings is a straightforward generation of ScanTable parameters.
Requiring the grouping of certain or all records of a container has of course a negative
impact on the locality of data access and should be used as sparingly as possible.
Because they are already called with the correct set of records, the callback
func</p>
        <p>Listing 1.3. Function for computing the set union
def union ( r : R e l a t i o n , s : R e l a t i o n ) : R e l a t i o n = f
val d = p r o c e s s ( r . a l l , s . a l l ) f ( p1 , p2 ) =&gt;</p>
        <p>return new P a r t i t i o n ( p1 . t u p l e s ++ p2 . t u p l e s )
g
p r o c e s s (Ud . e q u i v a l e n t s ( r . a t t r i b u t e s ++ s . a t t r i b u t e s ) )
f p a r t i t i o n =&gt;</p>
        <p>return new P a r t i t i o n ( p a r t i t i o n . randomTuple ( ) )
g
g</p>
        <p>Listing 1.4. Function for computing the set di erence
def d i f f e r e n c e ( r : R e l a t i o n , s : R e l a t i o n ) : R e l a t i o n =
p r o c e s s ( r . i n d i v i d u a l s , s . a l l ) f ( p1 , p2 ) =&gt;
val t u p l e s = for ( t u p l e &lt; p1 i f ! p2 . c o n t a i n s ( p1 ) ) yield f
t u p l e
g
g
new P a r t i t i o n ( t u p l e s )
tions for equivalents and all do not need an inner loop to iterate over records.
They simply contain the cross-compiled code of the respective user function and
at the end an insert message to materialize the output tuples of the user function.</p>
        <p>The code for a multi relation execute is a little bit more involved because
a ScanTable will only ever touch one single ERIS container. Combining data
from containers C1; : : : ; Cn requires n sequential ScanTable runs, one on each of
the containers. As an example, we will examine a process over three individuals
partitionings P1; P2; P3. Processing will start with a ScanTable on P1 with a
callback that iterates over the individual records. For each record the callback
will request a new ScanTable over P2 and send the current record as additional
data to the callback of the new scan. Each callback over P2 will in turn iterate
over the individual records of P2, request a third ScanTable on P3, and send the
records from P1 and P2 as additional data to the callback of the scan over P3.
The callback of the nal scan iterates once again over the records of its container,
combines each one with the other two records, and hands them to the code of
the user function in its iterator loop. Similar to the earlier examples, the result
records produced by the user code have to be inserted into a new container at the
end of the loop. The algorithm for three individuals partitionings can be easily
generalized to other types of partitionings and relation counts, so we abstain
from a more detailed description at this point.
4.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Complex Query Example</title>
        <p>We nish our discussion of the ERIS programming framework with a high level
example that combines multiple relational operators to implement the simple
SQL query depicted in listing 1.5.</p>
        <p>Listing 1.5. A simple SQL example
SELECT s t u d e n t . name g r a d e . c o u r s e g r a d e . g r a d e
FROM s t u d e n t , g r a d e
WHERE s t u d e n t . i d = g r a d e . s t u d e n t I D AND s t u d e n t . age &gt; 3 0 ;</p>
        <p>Listing 1.6 shows a straightforward translation from SQL to relational
operators. Each operator is annotated with the process calls it requires and with the
number of ScanTable operations necessary to execute the process calls in ERIS .</p>
        <p>Listing 1.7. Optimized SQL translation. 3 x ScanTable
In total, the direct translation of the SQL statement requires 5 ScanTable
operations: 2 for the cross, 1 for the select, and 2 more for the projection.</p>
        <p>Listing 1.6. Naive SQL translation. 5 x ScanTable
// 2 x ScanTable
// c = p r o c e s s ( s t u d e n t s . i n d i v i d u a l s , g r a d e s . i n d i v i d u a l s )
val c = c r o s s ( s t u d e n t s , g r a d e s )
// s = p r o c e s s ( c . i n d i v i d u a l s ) , 1 x ScanTable
val s = s e l e c t ( c , f t : Tuple =&gt;
return t ( " s t u d e n t . i d " ) == t ( " g r a d e . s t u d e n t I D " )</p>
        <p>&amp;&amp; t ( " s t u d e n t . age " ) &gt; 3 0 g )
// nu = p r o c e s s ( s . i n d i v i d u a l s ) , 1 x ScanTable
// r e s u l t = p r o c e s s ( nu . e q u i v a l e n t s ( . . . ) ) , 1 x ScanTable
val a t t r s = [ " s t u d e n t . name" , " g r a d e . c o u r s e " , " g r a d e . g r a d e " ]
val r e s u l t = p r o j e c t ( s , a t t r s )</p>
        <p>
          The straightforward translation provides a good opportunity for an
important domain speci c optimization. The rst process joins the students and grades
tuples and the two following process invocations process the individuals
partitioning of the crossed relations. The individual tuples of the crossed relations
are already materialized in the rst process invocation. Therefore, we are able to
append the bodies of the second and third process to the body of the rst
process and do all work in a single process invocation. This optimization is a good
example of what is possible with a compiler that can be extended with domain
speci c knowledge. Listing 1.7 shows an outline of the optimized code, where
the cross, select, and the rst part of the project operations are fused into a
single process invocation. The optimized version of the code reduces the number
of required ScanTable operations to three and avoids two costly materializations
of intermediate data containers.
In this paper, we have presented our programmability idea for an in-memory
storage engine which is based on a data-oriented architecture. Instead of relying
on hardcoded physical operators, our approach introduces data partitionings
as rst-class citizens on the programming layer as second-order functions, which
can be parameterized using any kind of rst-order functions. Fundamentally, this
approach is similar to the PACT programming model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], whereas our concept
is not limited to the key-value data format. Based on our overall idea, we want
to extend our work in di erent directions. First, we want to build a more user
friendly domain speci c language (DSL) based on our partitioning schemes for
SQL. Second, we are going to map di erent languages like SQL or Pig to our
DSL to support these higher-level languages in an e cient way. Third, we will
investigate further application domains to nd additional speci c partitioning
schemes.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ailamaki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balazinska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carey</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudhuri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gehrke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haas</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellerstein</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ioannidis</surname>
            ,
            <given-names>Y.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jagadish</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kossmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehrotra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naughton</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markl</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olston</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ooi</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Re</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suciu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walter</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widom</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <source>The beckman report on database research. SIGMOD Rec</source>
          .
          <volume>43</volume>
          (
          <issue>3</issue>
          ),
          <volume>61</volume>
          {70 (Dec
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ailamaki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brewer</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carey</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudhuri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Florescu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Molina</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gehrke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gruenwald</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haas</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellerstein</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ioannidis</surname>
            ,
            <given-names>Y.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korth</surname>
            ,
            <given-names>H.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kossmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magoulas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ooi</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Reilly</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szalay</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The claremont report on database research</article-title>
          .
          <source>SIGMOD Rec</source>
          .
          <volume>37</volume>
          (
          <issue>3</issue>
          ),
          <volume>9</volume>
          {
          <fpage>19</fpage>
          (Sep
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Alexandrov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Battre</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ewen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heimel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hueske</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kao</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markl</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nijkamp</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warneke</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Massively parallel data analysis with pacts on nephele</article-title>
          .
          <source>PVLDB</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <volume>1625</volume>
          {
          <fpage>1628</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , K.J.,
          <string-name>
            <surname>Sujeeth</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rompf</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Odersky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olukotun</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A heterogeneous parallel framework for domain-speci c languages</article-title>
          .
          <source>In: Parallel Architectures and Compilation Techniques (PACT)</source>
          , 2011 International Conference on. pp.
          <volume>89</volume>
          {
          <fpage>100</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kissinger</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiefer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlegel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Habich</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molka</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehner</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>ERIS: A numa-aware in-memory storage engine for analytical workload</article-title>
          .
          <source>In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures</source>
          . pp.
          <volume>74</volume>
          {
          <issue>85</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Klonatos</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rompf</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cha</surname>
          </string-name>
          , H.:
          <article-title>Building e cient query engines in a high-level language</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>7</volume>
          (
          <issue>10</issue>
          ),
          <volume>853</volume>
          {
          <fpage>864</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pandis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , Johnson, R.,
          <string-name>
            <surname>Hardavellas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ailamaki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Data-oriented transaction execution</article-title>
          .
          <source>PVLDB</source>
          <volume>3</volume>
          (
          <issue>1-2</issue>
          ),
          <volume>928</volume>
          {
          <fpage>939</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>