=Paper=
{{Paper
|id=Vol-2126/paper1
|storemode=property
|title=Adaptive Data Processing in Heterogeneous Hardware Systems
|pdfUrl=https://ceur-ws.org/Vol-2126/paper1.pdf
|volume=Vol-2126
|authors=Bala Gurumurthy,Tobias Drews,David Broneske,Gunter Saake,Thilo Pionteck
|dblpUrl=https://dblp.org/rec/conf/gvd/GurumurthyDBSP18
}}
==Adaptive Data Processing in Heterogeneous Hardware Systems==
<pdf width="1500px">https://ceur-ws.org/Vol-2126/paper1.pdf</pdf>
<pre>
     Adaptive Data Processing in Heterogeneous Hardware
                          Systems

                Bala Gurumurthy                            Tobias Drewes                      David Broneske
           Otto-von-Guericke-Universität           Otto-von-Guericke-Universität        Otto-von-Guericke-Universität
          bala.gurumurthy@ovgu.de tobias.drewes@ovgu.de david.broneske@ovgu.de
                           Gunter Saake           Thilo Pionteck
                               Otto-von-Guericke-Universität          Otto-von-Guericke-Universität
                                gunter.saake@ovgu.de                  thilo.pionteck@ovgu.de

ABSTRACT                                                              efficiency [12].
In recent years, Database Management System have seen                    Further, the throughput of any software system depends
advancements in two major sectors, namely functional and              on its underlying hardware. Many specialized compute de-
hardware support. Before, a traditional DBMS was suffi-               vices are fabricated to perform certain specific tasks effi-
cient for performing a given operation, whereas a current             ciently. These devices trade-off higher efficiency in cer-
DBMS is required to perform complex analytical tasks like             tain tasks for generality. They are used as additional co-
graph analysis or OLAP. These operations require addi-                processors along with the CPU for better throughput. One
tional functions to be added to the system for processing.            of the commonly used co-processor is a GPU, which is
Also, a similar evolution is seen in the underlying hard-             mainly used for enhancing graphical processing in a sys-
ware. This advancement in both functional and hardware                tem. The parallelism in GPU has been already exploited
domain of DBMS requires modification of its overall archi-            extensively for several DBMS operations [3, 1]. Similarly,
tecture. Hence, it is evident that an adaptable DBMS is               other devices are available such as MIC (Many Integrated
necessary for supporting this highly volatile environment.            Cores), APU (Accelerated Processing Unit), FPGA (Field
In this work, we list the challenges present for an adaptable         Programmable Gate Array), etc., that could be exploited for
DBMS and propose a conceptual model for such a system                 efficiently executing DBMS operations. The major challenge
that provides interfaces to easily adapt to the software and          in integrating this hardware, is the execution of the device
hardware changes.                                                     specific variant of the same DBMS operation optimized for
                                                                      the given hardware.
                                                                         Thus, the availability of different hardware enables a new
1.   INTRODUCTION                                                     level of parallelism that we call cross-device parallelism. In
   In recent years, a traditional Database Management Sys-            this level, a user given query is executed in parallel among
tem (DBMS) is required to also perform various complex                different devices for concurrent execution. Along with cross-
operations that are not directly supported by it. To per-             device parallelism, we also have the traditional pipeline and
form these special analytical operations, several tailor-made         data parallel execution of functions to increase the efficiency.
functions are integrated into the DBMS [12]. Further, the             These dimensions of parallelism incurs additional complex-
efficiency of a DBMS depends on the throughput of the un-             ity of efficiently traversing them to determine the optimal
derlying hardware. DBMS operations are ported to differ-              execution path for executing a given query.
ent specialized hardware to achieve better throughput. This              Along with optimization, the heterogeneity of hardware
heterogeneity of functionalities and hardware systems re-             requires concepts for reducing the data transfer cost among
quire modifications of the existing DBMS structure incur-             different devices. In a main memory DBMS, the device
ring additional complexities and challenges.                          transfer bottleneck exists between main-memory and co-
   In the current context, various analytical tasks such as           processing devices. Hence, it is also crucial to minimize
graph processing, or data mining are executed in DBMSs [8,            the data transfer time for improving the efficiency of DBMS
9]. These functionalities are ported to a DBMS with an                processing.
overhead of altering the overall architecture of the system.             Hence, heterogeneity of functions and hardware has mul-
However, modification of the complete system structure is             tiple challenges to be addressed and requires a system that
time consuming and also not all components are tuned for              is adaptable for these changes. In this work, we provide our
                                                                      insights on the challenges present in developing an adaptive
                                                                      database system and the techniques on overcoming these
                                                                      challenges. As there could be more functionalities and hard-
                                                                      ware available in future to be integrated into DBMS, we
                                                                      focus on a plug’n’play architecture that enables addition of
                                                                      these newer functions and hardware with considerably lesser
                                                                      overhead than upgrading the complete architecture. This ar-
                                                                      chitecture provides interfaces for integrating different func-
30th GI-Workshop on Foundations of Databases (Grundlagen von Daten-   tionalities and hardware into DBMS with less effort.
banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany.
Copyright is held by the author/owner(s).
The main contributions from our work are,                         Other Devices
                                                                  There are other hardware used for DBMS are MICs (Many
     • The existing challenges for an adaptive DBMS in the        Integrated Core) and APUs (Accelerated Processing Units).
       context of hardware and software heterogeneity.            In case of MIC, there are multiple CPU cores available for
                                                                  processing connected with each other using an on-chip bus
     • The concepts for developing an adaptable DBMS with
                                                                  system. These processors are capable of performing complex
       plug’n’play capabilities.
                                                                  computations. Whereas, APUs have both CPU and GPU in
                                                                  a single die. Here, both both CPU and GPU have access to
   The subsequent paper is structured as follows. In Sec-
                                                                  the same memory space (i.e. main memory).
tion 2, we provide an overview of the different devices used
for DBMS and list out the challenges in using them. Then
in Section 3, we discuss about the various challenges present     3.    CHALLENGES IN HETEROGENEOUS
due to functional and hardware heterogeneity in DBMS and                ENVIRONMENTS
in Section 4, we provide our concepts on developing an
adaptable DBMS that addresses these challenges. The con-            To have a DBMS adaptable to both changing hardware
ceptual discussion in this paper are already explored in dif-     and software the following challenges has to be addressed.
ferent works and we detail about them in Section 7. Finally,
we provide the summary of the paper in Section 8.                 3.1   Device Features
                                                                     Adding DBMS operation to a new processing device re-
                                                                  quires novel ways to exploit the device without compromis-
2.     DBMS IN HETEROGENEOUS HARD-                                ing the overall system design. Hence, one of the major chal-
       WARE ENVIRONMENT                                           lenge is to reorganize the processing functions based on the
   Relying on CPUs as the working horse is approaching the        hardware features available and must also adapt the under-
limits of their efficiency [2]. There are multiple works con-     lying functions for efficient execution in the device.
ducted to port the existing database operations to different
hardware. We discuss these devices and their DBMS sup-            3.2   Abstraction Hierarchy
port below.                                                         It is shown that speedup gains for any particular database
                                                                  operation can be achieved by performing device-specific pa-
GPU                                                               rameter tuning of the given operation [5]. Removing these
Single core of the GPU has a lower clock frequency com-           device-specific parameters aids adaptability but makes it
pared to a CPU core, a GPU features several hundreds of           hard to tune for optimal execution as each device has its
them compared to several tens of cores that current CPUs          own advantage. Due to this polarity in abstraction versus
offer. However, especially memory accesses have high la-          specialization between functions and devices, it is required
tency that needs to be hidden by processing. To this end,         that we find a good abstraction level for the operations that
they spawn multiple threads for a given function and do           provides both an interface to write new functions and also
context switching to hide the latency. This massive paral-        exploits the hardware for optimal efficiency.
lelism in GPU are useful in performing data intensive DBMS
operations. Some of the DBMS using GPU are, CoGaDB,               3.3   Parallelism Complexity
GPUDB, etc,. [3, 7].                                                 The growth of DBMS in both functional and hardware
   The major open challenges in using GPU for DBMS are,           level provides various parallelization opportunities. Pres-
                                                                  ence of multiple devices creates an additional paradigm :
     1. Cost model for determining the executable operator in     cross-device parallelization. Using this type of parallelism,
        GPU during runtime                                        the given query is divided into granular parts based on the
                                                                  level of abstraction selected and these functional primitives
     2. Combined query compilation and execution strategies       are distributed among the different processing devices for
        for CPU and GPU.                                          parallel processing. We detail the different types of paral-
                                                                  lelization below,
FPGA
Another hardware that has gained much attention in recent
                                                                  Functional Parallelism
years is a FPGA (Field Programmable Gate Array). They             In multiple instances, the incoming queries have various sub-
are programmed either using RTL (Register Transfer Level)         operations that run independent to each other. One com-
languages (VHDL, Verilog) or via HLS (High-Level Synthe-          mon example is the availability of multiple selection pred-
sis), where the circuits are extracted for example from C or      icates combined using logical operations. These predicates
OpenCL code. This provides a platform that can be tuned           can be executed in parallel among the different devices and
to perfection for any given domain specific operation pro-        the results are combined in next steps. Thus, the other
viding higher throughput.                                         way round: identifying and dissecting and identifying these
   The open challenges in using FPGA are,                         parallel operations provide additional capabilities for simul-
                                                                  taneous execution in the form of functional parallelism. The
     1. Selection and placement of operators for partial recon-   major challenge in this parallelism is the intermediate step
        figurable implementations                                 of materialization of the results to be processed in the next
                                                                  operator in the pipeline. There is also a synchronization
     2. Efficient pipelining between different operations at      overhead present in this parallelism due to the differences in
        runtime                                                   the execution time for different processing devices.
Data Parallelism
In contrast to functional parallelism, data parallelism does
not split an operation into to different functions but exe-
cutes same operation on different partitions of the data con-
currently. This method also has a similar synchronization
overhead of waiting for all the devices to finish processing.
The major disadvantage of this parallelism is the additional
step to merge results from different devices.

Cross-Device Parallelism
The above mentioned functional and data level parallelism
are decided after the selection of processing devices. As we
mentioned earlier, each devices have their own perks and
must be utilized to the maximum extent. Hence, it is neces-
sary to decide on the implementation details for the given de-
vice that exploits the hardware for efficient execution. More-
over, the above mentioned parallelization strategies can also
be realized in the device level. In terms of device-level func-
tional parallelism, it could be a multiple operator running in
parallel in different devices or in a pipeline with communica-
tion within the devices. Similarly, the data parallelism could
also be realized via suitable cost functions for operations on
devices.

3.4     Optimization Strategies
   The different levels of parallelism for execution of a query
                                                                    Figure 1: Optimization strategies for TPC-H Q6
provide additional opportunities for fine tuning the opera-
tions but has the complexity of selecting optimal execution
path. As the decision of top level parallelism influences the         • Materialize extracts the selected column values using
subsequent levels, selection of the right execution path for a          the given bitmap input.
given query is critical. However, the important drawback of
this multi-level parallelism model is the search space explo-         • Arithmetic performs arithmetic operation over col-
sion. There are various options available for any given level           umn values.
thereby having multiple combinations in total for selection.
This search space of parallelism has to be traversed for find-        • Reduce performs aggregation of values.
ing the optimal execution path. Deciding the optimal path
of a single operation in a query can be complex (e.g, join            The flow of execution of query6 using these primitives
order optimization) which in addition with new dimensions          is given in in Figure 1. The figure also illustrates different
of multiple devices increases the complexity further. Hence,       optimizations that can be done for a simple query like in the
newer methods for exploring the various optimization op-           figure. We discuss about the various optimization strategies
portunities are to be determined.                                  in the subsequent sections.

4.    ADAPTABLE DBMS                                               4.1    Granularity of Operation
                                                                     One of the main challenges in the proposed adaptable sys-
   The mentioned challenges require a DBMS architecture
                                                                   tem is the level of granularity required for optimized process-
that efficiently handles the diversity in both functionality
                                                                   ing. Based on the capabilities of the devices, we could either
and hardware. Based on the challenges, we have found areas
                                                                   run a few complex operation or split them into more gran-
to be explored for designing an adaptive DBMS.
                                                                   ular sub-operations and then also execute those ins parallel
   For a better explanation of the challenges we use the TPC-
                                                                   among multiple devices.
H query6 as our motivating example. The query selects data
                                                                     At the top level, each database operation acts as a
from multiple columns, performs multiplication of results
                                                                   set of primitives connected together to provide a final
and outputs the aggregate. These three operations are in
                                                                   result. The more granular a function is split, the more
turn executed using multiple granular primitive functions.
                                                                   hardware sensitiveness comes into play. For example, the
The different primitives used for processing the given query
                                                                   access patterns in CPU and GPU are different for efficient
are,
                                                                   processing. Further, database operations are data centric
     • Selection primitive selects the values from the given       where every operation is applied to a massive amount of
       column. Bitmaps are used as output format to reduce         data. To aid parallel data processing, we propose the use
       the data transfer size, as each bit carries the selection   of explicitly data parallel primitives to be combined into
       information of single value.                                complete DBMS operations. There are many works on
                                                                   primitive based DBMS query processing. He et al., propose
     • Logical Operation primitives performs logical func-         multiple primitives such as Split, Filter, etc., for GPUs [7].
       tions on the bitmaps produced by the different selec-       Other primitives such as prefix-sum and its variants, scatter
       tions.                                                      and gather are also proposed for efficient data parallel
execution [6]. This approach provides a fail safe: when a        processing to exploit the hardware capabilities.
newer device is added the primitives could still run on them        Also in an abstract level, characteristics of the primitive
with minor changes to the functionality. This availability of    itself can affect system throughput. The choice output for-
different granular levels provide additional benefit enabling    mat and the number of intermediate steps are some of the
developer to replace the inefficient fine-granular primitives    characteristics that influence the overall system. For exam-
with custom coarse-granular ones.                                ple, using bitmap results from selection in external devices
                                                                 will be generally more efficient than transferring complete
                                                                 column.
4.2   Code Fusion                                                4.5    Device-Related Parameter Tuning
   Implementing primitives in multiple granular levels be-          Finally, once we have decided on the device and its corre-
comes time consuming. Hence, code could be generated             sponding function to execute, certain device related param-
at runtime for the given granularity level of the operation.     eters like global and local work group sizes have to be tuned
This code for execution in an individual device is generated     for further improvement of the overall efficiency. These de-
by combing the primitives for the corresponding device into      vice related parameters are tuned for efficiency by moni-
single execution process. This reduces the overhead of ma-       toring the performance of execution. There is a feedback
terializing data from intermediate steps.                        loop from the devices, providing execution specific informa-
                                                                 tionused for tuning the primitive for higher efficiency.
                                                                    Other than these above mentioned challenges, one of the
                                                                 major challenge is to formulate an order for using the strate-
                                                                 gies to extract an efficient execution plan. Since all the
                                                                 strategy mentioned above are inter-dependent, selection of
                                                                 one depends on the other. In order to have a standardized
                                                                 execution flow, we propose an architecture that has all the
                                                                 necessary components used for using the above strategies.

                                                                 5.    CONCEPTUAL ARCHITECTURE
         Figure 2: Code synthesis - selection                       As mentioned earlier, the overall efficiency of processing a
                                                                 query in a heterogeneous environment requires all the men-
  For example, three selection predicates as shown in Fig-       tioned optimization strategies to be applied to a given query.
ure 2 can be either run in different devices (left) and the      To aid this, we propose a DBMS architecture that provides
results are combined using the logical operations, or the        a structure to handle the optimization from global abstrac-
predicates are all combined into single execution (right).       tion to local device specific levels. The structure is shown
                                                                 in Figure 3.
4.3   In-Device Cache
   The current data-transfer bottleneck is between the main
memory and the processing devices itself. CPU has faster
access than other devices as it is directly linked to the main
memory, whereas in case of the co-processors, data must be
transported via connections with higher latency and pos-
sibly more limited bandwidth, such as PCIexpress. Thus,
even highly efficient GPUs can have sub-optimal perfor-
mance than CPUs due limited access capabilities to main
memory. Hence, using device memory as data cache is cru-
cial for high compute throughput. In contrast to this, these
external devices have limited memory. Hence, it is not al-
ways possible to store all the necessary data on the device
itself. Thus, the host system must determine the hot set of
data to be stored in the device memory using the execution
plan for the given query and monitoring the data transfer
to the device and .                                                        Figure 3: Conceptual Architecture

4.4   Execution Variants                                            The given query is first subjected to the general logical
   Each primitive selected for executing a given query can       optimization and access path selection steps. Global opti-
have different characteristics to choose from based on the       mization is done over the resultant query from logical and
executing device. For example, complex branching state-          access path selection steps. This step determines the level of
ments are handled efficiently by CPUs, whereas GPUs are          granularity for the given query. once selected, these granu-
capable of massive thread level parallelism with less control    lar operations are provided to their respective device based
flow. In addition, the data access pattern must be selected      on decision given using hardware supervisor. Finally, lo-
the memory architecture of the given device. For example,        cal optimization is done the granular operations to tune for
coalesced data access provides efficient memory access in        their respective devices and the kernels that always work
GPU. Finally, hardware specific vectorization of DBMS op-        together are combined together. The components used for
erations (SIMD) is also an important parameter in database       these optimizations done are discussed below,
   Global Optimizer: The global optimizer processes the                           0.08
                                                                                                                                                  Selection
complete query and provides the best execution plan for the                                                                                       Logical


                                                                 Time (in secs)
whole query. It decides on the level of granularity to be                         0.06                                                        Materialize

used as primitive. In addition to the different granularities,                                                                                Arithmetic

the parallelism strategies (i.e. pipeline and data) are also                      0.04
                                                                                                                                                  Reduce
                                                                                                                                                  Baseline
selected here.
   The different schema available for executing a single query
                                                                                  0.02
leads to search space explosion. Traversing the whole design
space might be time consuming and hence a machine learn-
ing based cost estimation algorithm is used [4].                                    0


                                                                                                                         P -


                                                                                                                                    P -


                                                                                                                                    ar -
                                                                                                                   P


                                                                                                                        C P


                                                                                                                                 G P


                                                                                                                                 ne ne
                                                                                                          G M U


                                                                                                                  U
   Hardware Supervisor: The hardware supervisor pro-


                                                                                      F


                                                                                                                                     U
                                                                                                                          U
                                                                                                              + -


                                                                                                              + -


                                                                                                                         SD


                                                                                                                                    SD


                                                                                                                               Li eli
                                                                                                                P


                                                                                                                P
                                                                                     D


                                                                                                                P


                                                                                                            U P
                                                                                                               G


                                                                                                               C
                                                                                    C

                                                                                                           P MD


                                                                                                           P D


                                                                                                                                 as
vides statistical information about the underlying devices.


                                                                                                                               B
                                                                                                            U
This helps in improving the decisions made by the global


                                                                                                          C
optimizer. It combines the characteristics of individual de-
vices into a single integrated system view. This also com-
municates with the devices and supervises execution of op-                        Figure 4: Execution model Variants- results
erations in the individual devices.
   Storage Manager: The storage manager provides in-
formation about the location and availability of data to be         From the results, we see that the single device execution
processed. This aids in determining the transfer costs in or-    model for CPU has the lowest efficiency for processing Q6
der to aid selecting the execution device. Also, it is evident   and is even slower than the linear execution variant. This is
that not all devices have direct access to the main memory.      due to the additional materialization step to be performed.
Hence, it is the task of storage manager to partition and        In case of single device execution of the query in GPU, the
transfer data to the respective devices.                         system is nearly 2.5x faster than the CPU variant and 2x
   Device Manager: Each device manager has two sub-              faster than the scalar version.
components: Monitor and Local optimizer. Monitors pro-              For the multi device pipelined model, we see the CPU se-
vide device specific information and the local optimizer holds   lection with GPU reduce variant is 2x slower than its coun-
information about the primitives implemented in the corre-       terpart. The selection phase in CPU takes considerable time
sponding device and also about the current workload. It          for processing the select, logical and materialize primitives,
uses these information to perform device specific optimiza-      whereas GPU selection higher execution time only for ma-
tions to further increase the processing efficiency of a given   terializing the values.
operation.                                                          Finally, we see that cross-device functional parallelism
                                                                 model has the highest efficiency in processing the query.
6.   PRELIMINARY EVALUATION                                      This is mainly due to the multiple selection predicates avail-
                                                                 able in the query. The latency of execution is reduced by
   To evaluate the efficiency of different parallelism mecha-    executing the selection and materialization steps in paral-
nisms, we executed the TPC-H query 6 by combining five           lel. The detailed information of the execution of individual
different primitives namely, Bitmap, Logical, Materialize,       primitives in this variant is shown in Figure 5.
Arithmetic and Reduce. All these primitives are data paral-
lel and are implemented using OpenCL. The execution path
                                                                                                                                     Selection
for the query is shown in Figure 1. For our evaluation, we
                                                                                     Time (in millisec)


                                                                                                                                      Logical
considered four different execution models as explained be-                                               20
                                                                                                                                    Materialize
low.
                                                                                                                                    Arithmetic
   Baseline linear execution: In the baseline version, we
                                                                                                                                      Reduce
execute the linear Q6 compiled query without parallelism
                                                                                                          10                           Wait
or primitives. The result of this execution are used as a
benchmark to compare with other parallel implementations.
   Single Device Primitives (SDP): In singular device
primitive version, the parallel primitives mentioned above                                                    0
                                                                                                                  CPU   GPU
are executed in a single device. The results for complete
execution of parallel primitives in both CPU and GPU are
recorded for analysis.                                                            Figure 5: Cross-device parallelism - results
   Multiple Device Pipelined (MDP): In the multi-
ple device pipelined variant, we split the query into two           From the chart, we see the devices wait at multiple in-
phases: selection and aggregation and execute them in a          stances for the other to finish to continue processing the
pipeline. We perform selection in CPU and aggregation in         query. In case of selection and materialization, GPU waits
GPU (MDP - CPU + GPU) and vice versa (MDP - CPU +                until CPU has processed its values before executing the next
GPU) recording their results.                                    results. Also, the CPU is idle when the GPU is computing
   Cross-Device Functional Parallel (CDFP): Finally,             the results of arithmetic, logical and aggregation operations.
the given query is split into functional units and the inde-        From these results, we infer that using functional par-
pendent units are executed concurrently in the devices.          allelism enhances the efficiency of query processing. The
   All these models are executed on a machine running            advantage of functional parallelism comes with the disad-
Ubuntu OS version 16.04 and gcc version 5.4.0 with Intel         vantage of synchronization overheads due to differences in
Core i5 CPU and Nvidia Geforce 1050 Ti GPU.                      processing speed among the different devices.
7.    RELATED WORK                                                  [2] S. Borkar and A. A. Chien. The Future of
   Karnagel et al., have explored the adaptivity in DBMS                Microprocessors. Communications of the ACM, pages
using primitives for executing a query [10]. They group a               67–77, 2011.
subset of primitives to be executed in a single device into         [3] S. Breß, M. Heimel, N. Siegmund, L. Bellatreche, and
an execution island and process them. Their also use device             G. Saake. GPU-accelerated database systems: Survey
level caching to reduce transfer overhead. Once the inter-              and open challenges. Transactions on Large-Scale
mediate result for an island is computed, an intermediate               Data- and Knowledge-Centered Systems, pages 1–35,
estimation step is done to select the subsequent devices. In            2014.
our method, the execution path is given by an optimizer and         [4] S. Breß and G. Saake. Why it is time for a HyPE.
is executed in by the devices.                                          Proceedings of the International Conference on Very
   In terms of granularity of operators, He et al,. have given          Large Databases, 6(12):1398–1403, 2013.
a comprehensive set of data parallel primitives that can be         [5] D. Broneske, S. Breß, M. Heimel, and G. Saake.
ported into various hardwares [7]. Our research comple-                 Toward hardware-sensitive database operations. In
ments theirs by adding new primitives and additional func-              Proceedings of the International Conference on
tionalities to the already defined ones. Similarly, Pirk et al,.        Extending Database Technology, pages 229–234, 2014.
have also given an abstracted set of primitives that could be       [6] Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd,
used in various platforms [11].                                         and J. Manferdelli. Fast Scan Algorithms on Graphics
                                                                        Processors. In Proceedings of the Annual International
8.    CONCLUSION                                                        Conference on Supercomputing, pages 205–213, 2008.
   We detailed in this paper, the need for an adaptive ar-          [7] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju,
chitecture for DBMS that can be easily modified based on                Q. Luo, and P. V. Sander. Relational Query
the underlying hardware and the software functionalities.               Coprocessing on Graphics Processors. ACM
In this adaptable DBMS the executable operations must be                Transactions on Database Systems, pages
generalized for high interoperability whereas, device specific          21:1—-21:39, 2009.
operations are needed for higher efficiency. Along with chal-       [8] J. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang,
lenge in selecting the right abstraction level, there are multi-        E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng,
ple challenges available for an adaptable DBMS in a hetero-             K. Li, and A. Kumar. The MADlib Analytics Library
geneous environment. Our main contribution in this work                 or MAD Skills, the SQL. Proceedings of the
is the framework for overcoming these challenges with the               International Conference on Very Large Databases,
concepts listed below,                                                  pages 1700–1711, 2012.
                                                                    [9] C. Ji, Y. Li, W. Qiu, U. Awada, and K. Li. Big data
     • Granular levels for DBMS operations                              processing in cloud computing environments.
                                                                        Proceedings of the International Symposium on
     • Device specific code generation                                  Pervasive Systems, Algorithms and Networks, pages
     • In-device data caching techniques                                17–23, 2012.
                                                                   [10] T. Karnagel, D. Habich, and W. Lehner. Adaptive
     • Device and functional variants of operator                       Work Placement for Query Processing on
                                                                        Heterogeneous Computing Resources. Proceedings of
     • Hardware and functionality based tunable parameters              the International Conference on Very Large Databases,
                                                                        pages 733–744, 2017.
   The interfacing of different components of DBMS is a chal-
                                                                   [11] H. Pirk, O. Moll, M. Zaharia, and S. Madden. Voodoo
lenging task in itself. A plug’n’play architecture in DBMS
                                                                        - A vector algebra for portable database performance
removes these overheads by providing interfaces for support-
                                                                        on modern hardware. Proceedings of the International
ing additional hardwares and softwares. Also, an adaptable
                                                                        Conference on Very Large Databases, pages
DBMS could additionally help in optimizing a new function-
                                                                        1707–1718, 2016.
ality that is formed by combing the given set of granular
primitives as the primitives are in itself tuned for efficiency.   [12] K.-U. Sattler and O. Dunemann. Sql database
Finally, this adaptive architecture of DBMS de-couples the              primitives for decision tree classifiers. In Proceedings
functional and device based execution layers thereby provid-            of the International Conference on Information and
ing independence between the operation and its correspond-              Knowledge Management, pages 379–386, 2001.
ing execution unit.

9.    ACKNOWLEDGMENTS
  This work was partially funded by the DFG (grant no.:
SA 465/51-1 and PI 447/9)

10.     REFERENCES
 [1] P. Bakkum and K. Skadron. Accelerating SQL
     database operations on a GPU with CUDA.
     Proceedings of the Workshop on General-Purpose
     Computation on Graphics Processing Units, pages
     94–103, 2010.

</pre>