<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adaptive Data Processing in Heterogeneous Hardware Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bala Gurumurthy Tobias Drewes</string-name>
          <email>bala.gurumurthy@ovgu.de</email>
          <email>bala.gurumurthy@ovgu.de tobias.drewes@ovgu.de</email>
          <email>tobias.drewes@ovgu.de</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Broneske</string-name>
          <email>david.broneske@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gunter Saake</string-name>
          <email>gunter.saake@ovgu.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thilo Pionteck</string-name>
          <email>thilo.pionteck@ovgu.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Otto-von-Guericke-Universität</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Otto-von-Guericke-Universität</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Otto-von-Guericke-Universität</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Otto-von-Guericke-Universität Otto-von-Guericke-Universität</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>In recent years, Database Management System have seen advancements in two major sectors, namely functional and hardware support. Before, a traditional DBMS was su cient for performing a given operation, whereas a current DBMS is required to perform complex analytical tasks like graph analysis or OLAP. These operations require additional functions to be added to the system for processing. Also, a similar evolution is seen in the underlying hardware. This advancement in both functional and hardware domain of DBMS requires modi cation of its overall architecture. Hence, it is evident that an adaptable DBMS is necessary for supporting this highly volatile environment. In this work, we list the challenges present for an adaptable DBMS and propose a conceptual model for such a system that provides interfaces to easily adapt to the software and hardware changes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In recent years, a traditional Database Management
System (DBMS) is required to also perform various complex
operations that are not directly supported by it. To
perform these special analytical operations, several tailor-made
functions are integrated into the DBMS [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Further, the
e ciency of a DBMS depends on the throughput of the
underlying hardware. DBMS operations are ported to di
erent specialized hardware to achieve better throughput. This
heterogeneity of functionalities and hardware systems
require modi cations of the existing DBMS structure
incurring additional complexities and challenges.
      </p>
      <p>
        In the current context, various analytical tasks such as
graph processing, or data mining are executed in DBMSs [
        <xref ref-type="bibr" rid="ref8 ref9">8,
9</xref>
        ]. These functionalities are ported to a DBMS with an
overhead of altering the overall architecture of the system.
However, modi cation of the complete system structure is
time consuming and also not all components are tuned for
e ciency [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Further, the throughput of any software system depends
on its underlying hardware. Many specialized compute
devices are fabricated to perform certain speci c tasks e
ciently. These devices trade-o higher e ciency in
certain tasks for generality. They are used as additional
coprocessors along with the CPU for better throughput. One
of the commonly used co-processor is a GPU, which is
mainly used for enhancing graphical processing in a
system. The parallelism in GPU has been already exploited
extensively for several DBMS operations [
        <xref ref-type="bibr" rid="ref1 ref3">3, 1</xref>
        ]. Similarly,
other devices are available such as MIC (Many Integrated
Cores), APU (Accelerated Processing Unit), FPGA (Field
Programmable Gate Array), etc., that could be exploited for
e ciently executing DBMS operations. The major challenge
in integrating this hardware, is the execution of the device
speci c variant of the same DBMS operation optimized for
the given hardware.
      </p>
      <p>Thus, the availability of di erent hardware enables a new
level of parallelism that we call cross-device parallelism. In
this level, a user given query is executed in parallel among
di erent devices for concurrent execution. Along with
crossdevice parallelism, we also have the traditional pipeline and
data parallel execution of functions to increase the e ciency.
These dimensions of parallelism incurs additional
complexity of e ciently traversing them to determine the optimal
execution path for executing a given query.</p>
      <p>Along with optimization, the heterogeneity of hardware
requires concepts for reducing the data transfer cost among
di erent devices. In a main memory DBMS, the device
transfer bottleneck exists between main-memory and
coprocessing devices. Hence, it is also crucial to minimize
the data transfer time for improving the e ciency of DBMS
processing.</p>
      <p>Hence, heterogeneity of functions and hardware has
multiple challenges to be addressed and requires a system that
is adaptable for these changes. In this work, we provide our
insights on the challenges present in developing an adaptive
database system and the techniques on overcoming these
challenges. As there could be more functionalities and
hardware available in future to be integrated into DBMS, we
focus on a plug'n'play architecture that enables addition of
these newer functions and hardware with considerably lesser
overhead than upgrading the complete architecture. This
architecture provides interfaces for integrating di erent
functionalities and hardware into DBMS with less e ort.</p>
      <sec id="sec-1-1">
        <title>The main contributions from our work are,</title>
        <p>The existing challenges for an adaptive DBMS in the
context of hardware and software heterogeneity.
The concepts for developing an adaptable DBMS with
plug'n'play capabilities.</p>
        <p>The subsequent paper is structured as follows. In
Section 2, we provide an overview of the di erent devices used
for DBMS and list out the challenges in using them. Then
in Section 3, we discuss about the various challenges present
due to functional and hardware heterogeneity in DBMS and
in Section 4, we provide our concepts on developing an
adaptable DBMS that addresses these challenges. The
conceptual discussion in this paper are already explored in
different works and we detail about them in Section 7. Finally,
we provide the summary of the paper in Section 8.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>DBMS IN HETEROGENEOUS HARD</title>
    </sec>
    <sec id="sec-3">
      <title>WARE ENVIRONMENT</title>
      <p>
        Relying on CPUs as the working horse is approaching the
limits of their e ciency [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. There are multiple works
conducted to port the existing database operations to di erent
hardware. We discuss these devices and their DBMS
support below.
      </p>
      <p>
        GPU
Single core of the GPU has a lower clock frequency
compared to a CPU core, a GPU features several hundreds of
them compared to several tens of cores that current CPUs
o er. However, especially memory accesses have high
latency that needs to be hidden by processing. To this end,
they spawn multiple threads for a given function and do
context switching to hide the latency. This massive
parallelism in GPU are useful in performing data intensive DBMS
operations. Some of the DBMS using GPU are, CoGaDB,
GPUDB, etc,. [
        <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
        ].
      </p>
      <p>The major open challenges in using GPU for DBMS are,
1. Cost model for determining the executable operator in</p>
      <p>GPU during runtime
2. Combined query compilation and execution strategies
for CPU and GPU.</p>
    </sec>
    <sec id="sec-4">
      <title>FPGA</title>
      <p>Another hardware that has gained much attention in recent
years is a FPGA (Field Programmable Gate Array). They
are programmed either using RTL (Register Transfer Level)
languages (VHDL, Verilog) or via HLS (High-Level
Synthesis), where the circuits are extracted for example from C or
OpenCL code. This provides a platform that can be tuned
to perfection for any given domain speci c operation
providing higher throughput.</p>
      <p>The open challenges in using FPGA are,
1. Selection and placement of operators for partial
recongurable implementations
2. E cient pipelining between di erent operations at
runtime
There are other hardware used for DBMS are MICs (Many
Integrated Core) and APUs (Accelerated Processing Units).
In case of MIC, there are multiple CPU cores available for
processing connected with each other using an on-chip bus
system. These processors are capable of performing complex
computations. Whereas, APUs have both CPU and GPU in
a single die. Here, both both CPU and GPU have access to
the same memory space (i.e. main memory).
3.</p>
    </sec>
    <sec id="sec-5">
      <title>CHALLENGES IN HETEROGENEOUS</title>
    </sec>
    <sec id="sec-6">
      <title>ENVIRONMENTS</title>
      <p>To have a DBMS adaptable to both changing hardware
and software the following challenges has to be addressed.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Device Features</title>
      <p>Adding DBMS operation to a new processing device
requires novel ways to exploit the device without
compromising the overall system design. Hence, one of the major
challenge is to reorganize the processing functions based on the
hardware features available and must also adapt the
underlying functions for e cient execution in the device.
3.2</p>
      <p>
        It is shown that speedup gains for any particular database
operation can be achieved by performing device-speci c
parameter tuning of the given operation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Removing these
device-speci c parameters aids adaptability but makes it
hard to tune for optimal execution as each device has its
own advantage. Due to this polarity in abstraction versus
specialization between functions and devices, it is required
that we nd a good abstraction level for the operations that
provides both an interface to write new functions and also
exploits the hardware for optimal e ciency.
3.3
      </p>
    </sec>
    <sec id="sec-8">
      <title>Parallelism Complexity</title>
      <p>The growth of DBMS in both functional and hardware
level provides various parallelization opportunities.
Presence of multiple devices creates an additional paradigm :
cross-device parallelization. Using this type of parallelism,
the given query is divided into granular parts based on the
level of abstraction selected and these functional primitives
are distributed among the di erent processing devices for
parallel processing. We detail the di erent types of
parallelization below,</p>
      <sec id="sec-8-1">
        <title>Functional Parallelism</title>
        <p>In multiple instances, the incoming queries have various
suboperations that run independent to each other. One
common example is the availability of multiple selection
predicates combined using logical operations. These predicates
can be executed in parallel among the di erent devices and
the results are combined in next steps. Thus, the other
way round: identifying and dissecting and identifying these
parallel operations provide additional capabilities for
simultaneous execution in the form of functional parallelism. The
major challenge in this parallelism is the intermediate step
of materialization of the results to be processed in the next
operator in the pipeline. There is also a synchronization
overhead present in this parallelism due to the di erences in
the execution time for di erent processing devices.</p>
      </sec>
      <sec id="sec-8-2">
        <title>Data Parallelism</title>
        <p>In contrast to functional parallelism, data parallelism does
not split an operation into to di erent functions but
executes same operation on di erent partitions of the data
concurrently. This method also has a similar synchronization
overhead of waiting for all the devices to nish processing.
The major disadvantage of this parallelism is the additional
step to merge results from di erent devices.</p>
      </sec>
      <sec id="sec-8-3">
        <title>Cross-Device Parallelism</title>
        <p>The above mentioned functional and data level parallelism
are decided after the selection of processing devices. As we
mentioned earlier, each devices have their own perks and
must be utilized to the maximum extent. Hence, it is
necessary to decide on the implementation details for the given
device that exploits the hardware for e cient execution.
Moreover, the above mentioned parallelization strategies can also
be realized in the device level. In terms of device-level
functional parallelism, it could be a multiple operator running in
parallel in di erent devices or in a pipeline with
communication within the devices. Similarly, the data parallelism could
also be realized via suitable cost functions for operations on
devices.
3.4</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Optimization Strategies</title>
      <p>The di erent levels of parallelism for execution of a query
provide additional opportunities for ne tuning the
operations but has the complexity of selecting optimal execution
path. As the decision of top level parallelism in uences the
subsequent levels, selection of the right execution path for a
given query is critical. However, the important drawback of
this multi-level parallelism model is the search space
explosion. There are various options available for any given level
thereby having multiple combinations in total for selection.
This search space of parallelism has to be traversed for
nding the optimal execution path. Deciding the optimal path
of a single operation in a query can be complex (e.g, join
order optimization) which in addition with new dimensions
of multiple devices increases the complexity further. Hence,
newer methods for exploring the various optimization
opportunities are to be determined.</p>
    </sec>
    <sec id="sec-10">
      <title>ADAPTABLE DBMS</title>
      <p>The mentioned challenges require a DBMS architecture
that e ciently handles the diversity in both functionality
and hardware. Based on the challenges, we have found areas
to be explored for designing an adaptive DBMS.</p>
      <p>For a better explanation of the challenges we use the
TPCH query6 as our motivating example. The query selects data
from multiple columns, performs multiplication of results
and outputs the aggregate. These three operations are in
turn executed using multiple granular primitive functions.
The di erent primitives used for processing the given query
are,</p>
      <p>Selection primitive selects the values from the given
column. Bitmaps are used as output format to reduce
the data transfer size, as each bit carries the selection
information of single value.</p>
      <p>Logical Operation primitives performs logical
functions on the bitmaps produced by the di erent
selections.</p>
      <p>One of the main challenges in the proposed adaptable
system is the level of granularity required for optimized
processing. Based on the capabilities of the devices, we could either
run a few complex operation or split them into more
granular sub-operations and then also execute those ins parallel
among multiple devices.</p>
      <p>
        At the top level, each database operation acts as a
set of primitives connected together to provide a nal
result. The more granular a function is split, the more
hardware sensitiveness comes into play. For example, the
access patterns in CPU and GPU are di erent for e cient
processing. Further, database operations are data centric
where every operation is applied to a massive amount of
data. To aid parallel data processing, we propose the use
of explicitly data parallel primitives to be combined into
complete DBMS operations. There are many works on
primitive based DBMS query processing. He et al., propose
multiple primitives such as Split, Filter, etc., for GPUs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Other primitives such as pre x-sum and its variants, scatter
and gather are also proposed for e cient data parallel
execution [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This approach provides a fail safe: when a
newer device is added the primitives could still run on them
with minor changes to the functionality. This availability of
di erent granular levels provide additional bene t enabling
developer to replace the ine cient ne-granular primitives
with custom coarse-granular ones.
4.2
      </p>
    </sec>
    <sec id="sec-11">
      <title>Code Fusion</title>
      <p>Implementing primitives in multiple granular levels
becomes time consuming. Hence, code could be generated
at runtime for the given granularity level of the operation.
This code for execution in an individual device is generated
by combing the primitives for the corresponding device into
single execution process. This reduces the overhead of
materializing data from intermediate steps.</p>
      <p>For example, three selection predicates as shown in
Figure 2 can be either run in di erent devices (left) and the
results are combined using the logical operations, or the
predicates are all combined into single execution (right).
4.3</p>
    </sec>
    <sec id="sec-12">
      <title>In-Device Cache</title>
      <p>The current data-transfer bottleneck is between the main
memory and the processing devices itself. CPU has faster
access than other devices as it is directly linked to the main
memory, whereas in case of the co-processors, data must be
transported via connections with higher latency and
possibly more limited bandwidth, such as PCIexpress. Thus,
even highly e cient GPUs can have sub-optimal
performance than CPUs due limited access capabilities to main
memory. Hence, using device memory as data cache is
crucial for high compute throughput. In contrast to this, these
external devices have limited memory. Hence, it is not
always possible to store all the necessary data on the device
itself. Thus, the host system must determine the hot set of
data to be stored in the device memory using the execution
plan for the given query and monitoring the data transfer
to the device and .
4.4</p>
    </sec>
    <sec id="sec-13">
      <title>Execution Variants</title>
      <p>Each primitive selected for executing a given query can
have di erent characteristics to choose from based on the
executing device. For example, complex branching
statements are handled e ciently by CPUs, whereas GPUs are
capable of massive thread level parallelism with less control
ow. In addition, the data access pattern must be selected
the memory architecture of the given device. For example,
coalesced data access provides e cient memory access in
GPU. Finally, hardware speci c vectorization of DBMS
operations (SIMD) is also an important parameter in database
processing to exploit the hardware capabilities.</p>
      <p>Also in an abstract level, characteristics of the primitive
itself can a ect system throughput. The choice output
format and the number of intermediate steps are some of the
characteristics that in uence the overall system. For
example, using bitmap results from selection in external devices
will be generally more e cient than transferring complete
column.
4.5</p>
    </sec>
    <sec id="sec-14">
      <title>Device-Related Parameter Tuning</title>
      <p>Finally, once we have decided on the device and its
corresponding function to execute, certain device related
parameters like global and local work group sizes have to be tuned
for further improvement of the overall e ciency. These
device related parameters are tuned for e ciency by
monitoring the performance of execution. There is a feedback
loop from the devices, providing execution speci c
informationused for tuning the primitive for higher e ciency.</p>
      <p>Other than these above mentioned challenges, one of the
major challenge is to formulate an order for using the
strategies to extract an e cient execution plan. Since all the
strategy mentioned above are inter-dependent, selection of
one depends on the other. In order to have a standardized
execution ow, we propose an architecture that has all the
necessary components used for using the above strategies.
5.</p>
    </sec>
    <sec id="sec-15">
      <title>CONCEPTUAL ARCHITECTURE</title>
      <p>As mentioned earlier, the overall e ciency of processing a
query in a heterogeneous environment requires all the
mentioned optimization strategies to be applied to a given query.
To aid this, we propose a DBMS architecture that provides
a structure to handle the optimization from global
abstraction to local device speci c levels. The structure is shown
in Figure 3.</p>
      <p>The given query is rst subjected to the general logical
optimization and access path selection steps. Global
optimization is done over the resultant query from logical and
access path selection steps. This step determines the level of
granularity for the given query. once selected, these
granular operations are provided to their respective device based
on decision given using hardware supervisor. Finally,
local optimization is done the granular operations to tune for
their respective devices and the kernels that always work
together are combined together. The components used for
these optimizations done are discussed below,</p>
      <p>Global Optimizer: The global optimizer processes the
complete query and provides the best execution plan for the
whole query. It decides on the level of granularity to be
used as primitive. In addition to the di erent granularities,
the parallelism strategies (i.e. pipeline and data) are also
selected here.</p>
      <p>
        The di erent schema available for executing a single query
leads to search space explosion. Traversing the whole design
space might be time consuming and hence a machine
learning based cost estimation algorithm is used [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Hardware Supervisor: The hardware supervisor
provides statistical information about the underlying devices.
This helps in improving the decisions made by the global
optimizer. It combines the characteristics of individual
devices into a single integrated system view. This also
communicates with the devices and supervises execution of
operations in the individual devices.</p>
      <p>Storage Manager: The storage manager provides
information about the location and availability of data to be
processed. This aids in determining the transfer costs in
order to aid selecting the execution device. Also, it is evident
that not all devices have direct access to the main memory.
Hence, it is the task of storage manager to partition and
transfer data to the respective devices.</p>
      <p>Device Manager: Each device manager has two
subcomponents: Monitor and Local optimizer. Monitors
provide device speci c information and the local optimizer holds
information about the primitives implemented in the
corresponding device and also about the current workload. It
uses these information to perform device speci c
optimizations to further increase the processing e ciency of a given
operation.</p>
    </sec>
    <sec id="sec-16">
      <title>6. PRELIMINARY EVALUATION</title>
      <p>To evaluate the e ciency of di erent parallelism
mechanisms, we executed the TPC-H query 6 by combining ve
di erent primitives namely, Bitmap, Logical, Materialize,
Arithmetic and Reduce. All these primitives are data
parallel and are implemented using OpenCL. The execution path
for the query is shown in Figure 1. For our evaluation, we
considered four di erent execution models as explained
below.</p>
      <p>Baseline linear execution: In the baseline version, we
execute the linear Q6 compiled query without parallelism
or primitives. The result of this execution are used as a
benchmark to compare with other parallel implementations.</p>
      <p>Single Device Primitives (SDP): In singular device
primitive version, the parallel primitives mentioned above
are executed in a single device. The results for complete
execution of parallel primitives in both CPU and GPU are
recorded for analysis.</p>
      <p>Multiple Device Pipelined (MDP): In the
multiple device pipelined variant, we split the query into two
phases: selection and aggregation and execute them in a
pipeline. We perform selection in CPU and aggregation in
GPU (MDP - CPU + GPU) and vice versa (MDP - CPU +
GPU) recording their results.</p>
      <p>Cross-Device Functional Parallel (CDFP): Finally,
the given query is split into functional units and the
independent units are executed concurrently in the devices.</p>
      <p>All these models are executed on a machine running
Ubuntu OS version 16.04 and gcc version 5.4.0 with Intel
Core i5 CPU and Nvidia Geforce 1050 Ti GPU.
CDFP</p>
      <p>MDP+-GPUMDP+-CPU
CPU GPU</p>
      <p>SDCPPU</p>
      <p>SDGPP-UBaseLliinneea-r
From the results, we see that the single device execution
model for CPU has the lowest e ciency for processing Q6
and is even slower than the linear execution variant. This is
due to the additional materialization step to be performed.
In case of single device execution of the query in GPU, the
system is nearly 2.5x faster than the CPU variant and 2x
faster than the scalar version.</p>
      <p>For the multi device pipelined model, we see the CPU
selection with GPU reduce variant is 2x slower than its
counterpart. The selection phase in CPU takes considerable time
for processing the select, logical and materialize primitives,
whereas GPU selection higher execution time only for
materializing the values.</p>
      <p>Finally, we see that cross-device functional parallelism
model has the highest e ciency in processing the query.
This is mainly due to the multiple selection predicates
available in the query. The latency of execution is reduced by
executing the selection and materialization steps in
parallel. The detailed information of the execution of individual
primitives in this variant is shown in Figure 5.</p>
      <p>)
sce 20
i
l
l
i
m
n
i( 10
e
m
i
T
0</p>
      <p>CPU</p>
      <p>GPU</p>
      <p>Selection</p>
      <p>Logical
Materialize
Arithmetic</p>
      <p>Reduce
Wait
From the chart, we see the devices wait at multiple
instances for the other to nish to continue processing the
query. In case of selection and materialization, GPU waits
until CPU has processed its values before executing the next
results. Also, the CPU is idle when the GPU is computing
the results of arithmetic, logical and aggregation operations.</p>
      <p>From these results, we infer that using functional
parallelism enhances the e ciency of query processing. The
advantage of functional parallelism comes with the
disadvantage of synchronization overheads due to di erences in
processing speed among the di erent devices.</p>
    </sec>
    <sec id="sec-17">
      <title>RELATED WORK</title>
      <p>
        Karnagel et al., have explored the adaptivity in DBMS
using primitives for executing a query [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. They group a
subset of primitives to be executed in a single device into
an execution island and process them. Their also use device
level caching to reduce transfer overhead. Once the
intermediate result for an island is computed, an intermediate
estimation step is done to select the subsequent devices. In
our method, the execution path is given by an optimizer and
is executed in by the devices.
      </p>
      <p>
        In terms of granularity of operators, He et al,. have given
a comprehensive set of data parallel primitives that can be
ported into various hardwares [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Our research
complements theirs by adding new primitives and additional
functionalities to the already de ned ones. Similarly, Pirk et al,.
have also given an abstracted set of primitives that could be
used in various platforms [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-18">
      <title>CONCLUSION</title>
      <p>We detailed in this paper, the need for an adaptive
architecture for DBMS that can be easily modi ed based on
the underlying hardware and the software functionalities.
In this adaptable DBMS the executable operations must be
generalized for high interoperability whereas, device speci c
operations are needed for higher e ciency. Along with
challenge in selecting the right abstraction level, there are
multiple challenges available for an adaptable DBMS in a
heterogeneous environment. Our main contribution in this work
is the framework for overcoming these challenges with the
concepts listed below,</p>
      <sec id="sec-18-1">
        <title>Granular levels for DBMS operations</title>
      </sec>
      <sec id="sec-18-2">
        <title>Device speci c code generation</title>
      </sec>
      <sec id="sec-18-3">
        <title>In-device data caching techniques</title>
      </sec>
      <sec id="sec-18-4">
        <title>Device and functional variants of operator</title>
        <p>Hardware and functionality based tunable parameters
The interfacing of di erent components of DBMS is a
challenging task in itself. A plug'n'play architecture in DBMS
removes these overheads by providing interfaces for
supporting additional hardwares and softwares. Also, an adaptable
DBMS could additionally help in optimizing a new
functionality that is formed by combing the given set of granular
primitives as the primitives are in itself tuned for e ciency.
Finally, this adaptive architecture of DBMS de-couples the
functional and device based execution layers thereby
providing independence between the operation and its
corresponding execution unit.</p>
      </sec>
    </sec>
    <sec id="sec-19">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially funded by the DFG (grant no.:
SA 465/51-1 and PI 447/9)
10.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bakkum</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Skadron</surname>
          </string-name>
          .
          <article-title>Accelerating SQL database operations on a GPU with CUDA</article-title>
          .
          <source>Proceedings of the Workshop on General-Purpose Computation on Graphics Processing Units</source>
          , pages
          <volume>94</volume>
          {
          <fpage>103</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Borkar</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Chien</surname>
          </string-name>
          .
          <source>The Future of Microprocessors. Communications of the ACM</source>
          , pages
          <volume>67</volume>
          {
          <fpage>77</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heimel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Siegmund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellatreche</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Saake.</surname>
          </string-name>
          <article-title>GPU-accelerated database systems: Survey and open challenges. Transactions on Large-Scale Data-</article-title>
          and
          <string-name>
            <surname>Knowledge-Centered</surname>
            <given-names>Systems</given-names>
          </string-name>
          , pages
          <fpage>1</fpage>
          {
          <fpage>35</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bre</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Saake</surname>
          </string-name>
          .
          <article-title>Why it is time for a HyPE</article-title>
          .
          <source>Proceedings of the International Conference on Very Large Databases</source>
          ,
          <volume>6</volume>
          (
          <issue>12</issue>
          ):
          <volume>1398</volume>
          {
          <fpage>1403</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Broneske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heimel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Saake</surname>
          </string-name>
          .
          <article-title>Toward hardware-sensitive database operations</article-title>
          .
          <source>In Proceedings of the International Conference on Extending Database Technology</source>
          , pages
          <volume>229</volume>
          {
          <fpage>234</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dotsenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-P.</given-names>
            <surname>Sloan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Boyd</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Manferdelli</surname>
          </string-name>
          .
          <article-title>Fast Scan Algorithms on Graphics Processors</article-title>
          .
          <source>In Proceedings of the Annual International Conference on Supercomputing</source>
          , pages
          <volume>205</volume>
          {
          <fpage>213</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Sander</surname>
          </string-name>
          .
          <source>Relational Query Coprocessing on Graphics Processors. ACM Transactions on Database Systems</source>
          , pages
          <fpage>21</fpage>
          :1|-
          <lpage>21</lpage>
          :
          <fpage>39</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Re</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schoppmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fratkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gorajek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Welton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>The MADlib Analytics Library or MAD Skills, the SQL</article-title>
          .
          <source>Proceedings of the International Conference on Very Large Databases</source>
          , pages
          <volume>1700</volume>
          {
          <fpage>1711</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Awada</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Big data processing in cloud computing environments</article-title>
          .
          <source>Proceedings of the International Symposium on Pervasive Systems, Algorithms and Networks</source>
          , pages
          <volume>17</volume>
          {
          <fpage>23</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Karnagel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Habich</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Lehner</surname>
          </string-name>
          .
          <source>Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. Proceedings of the International Conference on Very Large Databases</source>
          , pages
          <volume>733</volume>
          {
          <fpage>744</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Moll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          .
          <article-title>Voodoo - A vector algebra for portable database performance on modern hardware</article-title>
          .
          <source>Proceedings of the International Conference on Very Large Databases</source>
          , pages
          <volume>1707</volume>
          {
          <fpage>1718</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>K.-U. Sattler</surname>
            and
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Dunemann</surname>
          </string-name>
          .
          <article-title>Sql database primitives for decision tree classi ers</article-title>
          .
          <source>In Proceedings of the International Conference on Information and Knowledge Management</source>
          , pages
          <volume>379</volume>
          {
          <fpage>386</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>