<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>PhD Workshop, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Heterogeneity-Aware Query Optimization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomas Karnagel Supervised by Wolfgang Lehner</string-name>
          <email>tomas.karnagel@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Database Technology Group</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universita ̈ t Dresden</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>9</volume>
      <issue>2016</issue>
      <abstract>
        <p>The hardware landscape is changing from homogeneous systems towards multiple heterogeneous computing units within one system. For database systems, this is an opportunity to accelerate query processing if the heterogeneous resources can be utilized e ciently. For this goal, we investigate novel query optimization concepts for heterogeneous resources like placement granularity, execution estimation, optimization granularity, and data handling. In the end, we combine these concepts in a specialized optimization stage during query optimization together with a unique way of evaluating our optimizations in existing database systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In the recent past, the database system's performance has
mainly been bound by disk accesses. With increasing main
memory sizes, the bottleneck shifts towards computation as
more and more data can be kept close to the processor.
To increase the computational performance in homogeneous
environments, parallel execution on multiple cores has been
studied. However, recent systems are becoming more and
more heterogeneous, including di erent types of computing
units (CUs) to improve e ciency and energy consumption,
ideally preventing dark silicon [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The main challenge for database systems is to adapt to
the new heterogeneous hardware environment with its di
erences in computing unit architectures, memory hierarchies,
and connections to the main memory.</p>
      <p>
        Previous research has been mainly about porting
operators to new hardware platforms like GPUs and FPGAs.
While this is important, single operators do not represent
full database systems with complex architectures and a
variety of workloads. In recent work, full database systems with
GPU support have been proposed [
        <xref ref-type="bibr" rid="ref1 ref3 ref4 ref9">1, 3, 4, 9</xref>
        ]. These
systems allow detailed evaluation of heterogeneous execution,
however, most of them do not understand the underlying
hardware but merely execute a query on a prede ned CU .
      </p>
      <p>In our work, we want to investigate dedicated query
optimization for heterogeneous computing resources. For this,
the system needs to, rst, understand the underlying
hardware environment and to, second, utilize it automatically in
the best possible way during query processing. We motivate
our direction of research with a single-operator case study
and de ne optimization concepts before proposing an ideal
system setup. We are currently in the implementation and
evaluation phase where we apply our optimizations within
multiple open-source database systems.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>MOTIVATION AND DIRECTION</title>
      <p>As a starting point, we would like to present a single
operator case-study to motivate the direction of our research.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Case Study: Group-By Operator</title>
      <p>For our case study, we use a hash-based group-by
operator on di erent CUs, implemented in OpenCL. The applied
hash-table uses FNV1a as hash function and a ll factor of
0.5, assuming the amount of groups is known from the
optimizer. We implement the operator to scan only one column
while storing the group name and a count value, as it would
be used for the following SQL query:
SELECT num, count(*) FROM numbers GROUP BY num;</p>
      <p>
        The input values (64 MB, 16.7 mio values) are in a range of
[0; #group) while being randomly distributed within the
input column. We store the input column in the system's main
memory (RAM) and evaluate the full execution runtime
including zero-copy accesses, where the data is streamed to
the CU on demand. When executing the operator, we see
several e ects leading to partly severe performance issues
(Figure 1). In previous work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we explained the e ects for
a Nvidia GPU in detail:
1. The spikes are created by high hashing contention that
mainly occurs using FNV1a with certain hash-table
sizes and data distributions.
2. For #groups &lt;100, we see problems with atomic
accesses because many threads try to update a small
number of hash-table buckets simultaneously.
3. For hash-tables &gt;1.5 MB, the hash-table does not t
in the GPU's L2 cache for fast execution.
4. For hash-tables &gt;2 GB, the execution experiences a
great slow-down through TLB cache problems.
      </p>
      <p>Out of all e ects, the spikes are the only ones that can
be seen on all CUs, since they are software-based issues.
For all CUs, they are occurring repeatably at exactly the
same positions, however, the height of the spikes depend on
the CU . The other e ects and the overall performance
differ greatly, which is caused by di erent cache sizes, di erent
connection to the system (e.g., PCIe2 or 3), or entirely
different architectures. Comparing all 3 executions, no single
CU is superior to the others. For our experiment, we tested
more than 7000 di erent group sizes, where the Nvidia GPU
(a) Nvidia K80 GPU (PCIe3)
(b) AMD HD7950 GPU (PCIe2)
(c) Intel Xeon Phi 7120 (PCIe2)
was the fastest in 71.4% of all cases, followed by the AMD
GPU with 22.5%, and the Intel Xeon Phi with 6.1%. The
Xeon Phi will become more important for larger hash-tables
(&gt;2GB) since the runtime is scaling much better.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Implementation Approaches for</title>
    </sec>
    <sec id="sec-5">
      <title>Heterogeneity-aware Database Systems</title>
      <p>Based on the performance di erences and the e ects in our
case study, we see two directions to implement a database
system using heterogeneous computing resources.</p>
      <p>
        The rst approach would be, to choose a single CU , e.g.,
the Nvidia GPU, and optimize the operator for ideal
execution on this particular CU . Previously, we did this for
the group-by operator [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] by adjusting execution
parameters and implementing algorithmic changes together with
an integrated optimizer to de ne the ideal con guration.
For a full system approach, these adjustments need to be
done for every operator in consideration of data sizes and
data distribution, resulting in a high number of ne-grained
optimizations. Once this huge e ort is made, it probably
results in the best possible performance for the supported CU ,
however, it is not portable. To support a di erent hardware
setup, the optimization e ort for each database operator has
to be revisited, adjusted, and ne-tuned. This can only be
done by large development teams, while limiting the support
to only a few selected CUs.
      </p>
      <p>The second approach, which is explored in this work, is
more adaptive. Instead of understanding and optimizing
every single e ect of each operator on each CU , we propose to
support as many CUs as possible, while dynamically de
ning the execution location (operator placement) depending
on the best runtime. Having multiple CUs to choose from
gives us the opportunity to execute on the ideal CU for a
given operator and workload. For the few CUs supported by
the rst approach, the performance will be lower, because
the operator implementations are less optimized. However,
it will provide the best possible performance for any given
setup of operators and CUs, without the huge e ort of
netuning. Additionally, it is highly portable since there are no
hard-coded hardware-speci c optimizations.
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Distinction</title>
      <p>Following the adaptive approach, we focus on query
optimization for heterogeneous computing resources, instead
of building an entirely new database system. Furthermore,
there are several related topics that we speci cally exclude
at this point of time:</p>
      <p>Speci c operator implementations. Operator
implementations are important but have been researched
extensively over the past 10 years. Di erent implementations lead
to performance di erences, however, they are not a ecting
the design of the heterogeneity-aware query optimizer.</p>
      <p>Memory heterogeneity. At this point, we are not
looking at di erent memory types such as non-volatile memory
vs. volatile memory or SSD vs. HDD. Memory types are
important for persistence and recovery consideration, however,
we are focusing our research on compute heterogeneity.</p>
      <p>Distributed systems and network heterogeneity.
At the moment we are looking at single node systems with
a scale up approach by adding more CUs. However, our
ndings can be easily reused in a distributed environment,
where we can map transfer costs between CUs to transfer
costs between nodes and a node can consist of multiple CUs.
3.</p>
    </sec>
    <sec id="sec-7">
      <title>OPTIMIZATION CONCEPTS</title>
      <p>The main part of this thesis is identifying and
investigating optimizer design choices to make database systems
heterogeneity-aware. As starting point, we assume a column
based database system with a column-at-a-time approach
since we mainly want to focus on large OLAP queries. In
the following, we want to present multiple design choices and
brief discussions on the most promising directions. Please
refer to the cited papers for more details.
3.1</p>
    </sec>
    <sec id="sec-8">
      <title>Placement Granularity</title>
      <p>As a main idea, we want to place parts of a database query
on CUs, where they show the best execution time in
consideration of data transfer costs. However, the granularity
of work, which is actually placed, needs to be de ned. In
query processing, we see three possible granularities.</p>
      <p>Query Granularity. One single placement decision is
made for a whole query, which is then executed on one
CU . This can be bene cial when there are many
concurrent queries that need to be executed, so that all CUs can
be used concurrently.</p>
      <p>Database Operator Granularity. One placement
decision is made for each database operator, leading to a
heterogeneous execution within a single query.</p>
      <p>Sub-Operator Granularity. Sub-operators are reusable
execution functions of an operator, e.g., a hash join may
consist of a hash-table creation and a hash-table probe, and
therefore it has two sub-operators. The same hash-table
creation step can be part of a hash based group-by
implementation. This granularity allows a ne-grained match between
execution behavior and CU .</p>
      <p>We choose the sub-operator granularity as the most
promising approach with its ne grained decisions. In the
remainder of this paper, we will use the term operator, for the
placement object to show the general applicability of the
proposed approaches.
3.2</p>
      <p>
        Before optimizing query processing on heterogeneous
computing resources, the database system needs to know the
execution time of operators. Traditionally, cardinality
estimation was used in order to nd the best query plan. With
di erent heterogeneous CUs, additional runtime-based
estimation is needed, because even same cardinalities can lead
to di erent runtimes on di erent CUs. For this estimation,
we proposed the Heterogeneity-aware Operator Placement
Model (HOP)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which is based on unassisted learning of
execution time, using interpolation between known
executions. Additionally, data transfers and scenarios with yet
unknown execution times are considered.
3.3
      </p>
    </sec>
    <sec id="sec-9">
      <title>Optimization Granularity</title>
      <p>The optimization granularity de nes how much knowledge
is needed for the optimization.</p>
      <p>A local strategy would decide the placement solely for
one operator at a time. The chosen placement combines the
best combination of input data transfers and actual
execution. For example, assuming the data lies in main memory,
a GPU is only used if data transfer and execution is faster
than the execution on the CPU, where data does not need
to be transfered.</p>
      <p>
        A global strategy would look beyond one operator at the
whole query plan. There, transfer costs between di erent
operators can be included in the optimization, leading to
globally optimized executions and transfers, while the local
strategy does not optimize beyond one operator execution.
To apply global optimization, the system has to consider all
operators of a query (#op) and all CUs of the system (#cu),
leading to a search space of #cu#op (for example 14 mio.
di erent placements for 15 operators and 3 CUs). To cope
with this huge search space, we developed ways to reduce the
number of considered operators together with a light-weight
greedy algorithm for e cient placement optimization [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        We implemented both strategies in an OpenCL-based
database system and compared the performance [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. While the
placements of both strategies are di erent, the execution
times do not di er much, because long-running in uential
operators are placed on the same CUs for both strategies.
However, we showed that global optimization is more
robust for inconclusive decisions where multiple operators can
bene t from each others' placement.
3.4
      </p>
    </sec>
    <sec id="sec-10">
      <title>Data Handling</title>
      <p>Normally, data handling involves transferring data to the
CU where it is needed if the data is not there already.</p>
      <p>To enhance this naive approach, we propose to improve
the data movement dependent on an operator's data access
type. This can be achieved by allowing replicas of memory
objects on di erent CUs, as long as data is only read. Then,
di erent operators can access replicas of data on di erent
CUs, allowing parallel execution and more freedom to nd
the ideal operator placement without being limited by high
transfer costs. However, when an operator is updating a
memory object, every replica, that is not updated, has to
be deleted to remain consistent.</p>
    </sec>
    <sec id="sec-11">
      <title>4. IDEAL SYSTEM SETUP</title>
      <p>an ideal system setup with heterogeneous resource
optimization to utilize these resources in the best possible way.
4.1</p>
    </sec>
    <sec id="sec-12">
      <title>System Integration</title>
      <p>The rst question is the integration aspect of
heterogeneous resource optimization within traditional query
optimizations.</p>
      <p>
        Execution Engine. The presented optimizations can
be implemented in the database's execution engine, being
applied directly before an operator's execution. We
implemented and evaluated such a system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, for
this approach, global optimization is not possible due to the
missing global view.
      </p>
      <p>Integrated. The optimizations could be deeply
integrated within the database optimizer. The optimizer has all
the global information for hardware optimization but it also
has a sophisticated optimization framework and strategies,
where adding heterogeneous resource optimizations would
increase the optimization complexity signi cantly.</p>
      <p>Separate Optimization Stage. We propose a middle
path: an additional stage of query optimization. As it is
usually the case, the database system rst optimizes the
query plan logically using query rewriting techniques. Then
the physical query operators are de ned in the physical
optimization. Afterwards, the physically optimized plan is
further optimized for the heterogeneous resources in a separate
stage. The main motivation for this approach is the
separation of concerns, that each stage can optimize
independently, allowing simpler architectures, better maintenance,
and reduced search spaces.
4.2</p>
    </sec>
    <sec id="sec-13">
      <title>Heterogeneous Resource Optimization</title>
      <p>Within the separated optimization stage for heterogeneous
resources, we are applying our concepts in several steps. We
assume to get a fully logically and physically optimized plan
from the prior optimization stages. Then, we apply the
following steps, which are illustrated in Figure 2:
1. Split up the database operators into sub-operators (as
explained in Section 3.1).
2. Apply data access information (as in Sec. 3.4).
Multiple sub-operators accessing the same data can choose
between replicas to potentially avoid data transfers
and read-only operators can be executed independently,
therefore dependencies can be reordered (Fig. 2 (2)).
A writer has to wait until previous readers have
nished before updating one replica and deleting others.
3. Estimate both the possible execution time for each
sub-operator on each CU and the transfer costs
between CUs. These estimations are done locally for
one sub-operator or transfer at a time using our model
presented in Section 3.2. For Example, Figure 2 (3)
shows only the sub-operators' execution times.
4. Finally, having all the estimated runtimes and transfer
times, we apply global optimization (as in Section 3.3)
to nd the placement with the overall best runtime.
After applying these four steps, the heterogeneous optimizer
can pass an enhanced sub-operator-based query plan with
assigned placement decisions to the execution engine for
heterogeneous execution.
4.3</p>
    </sec>
    <sec id="sec-14">
      <title>Evaluation (current progress)</title>
      <p>
        Having investigated the possible optimization concepts for
heterogeneous computing resources, we now want to de ne
To evaluate our optimization approach, we thought about
rewriting the database optimizer of heterogeneity-supporting
DBMS like Ocelot [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and gpuDB [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, this would
only be an isolated system-speci c analysis. To broaden the
scope of our evaluation, we decided to reuse the basic
technology many of these DBMS use to support heterogeneous
hardware: OpenCL. We can intercept the OpenCL
communication of these systems to the OpenCL driver, optimize
the given query, and execute the work heterogeneously,
depending on the available CUs. Technically, we do this by
implementing our own OpenCL driver that is loaded by the
database system. Using the driver approach, the database
code does not need to be adjusted to support our
optimizations. However, implementing our optimization stage into
an industry-size database system is left for future work.
      </p>
    </sec>
    <sec id="sec-15">
      <title>CONTRIBUTIONS</title>
      <p>In this section, we would like to highlight the contributions
of this thesis and di erentiate them from related work.</p>
      <p>
        We base our work on many previous publications
including full system approaches like Ocelot [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and gpuDB [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
These systems currently rely on a manually-speci ed input
to de ne the CU , on which the whole query is executed.
With our optimizer approach, we can make these systems
heterogeneity-aware and of better performance without the
need of manual inputs. Our contributions are in detail:
1. Providing an overall investigation for
heterogeneity-aware query optimization. Related work
includes the heterogeneity-aware database systems CoGaDB[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and gpuQP [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Both are no explicit query optimizers but
actual database systems. Both systems de ne the placement
of database operators, where the focus is more on the
system design and the runtime estimation model, than on the
actual query optimization.
      </p>
      <p>
        2. Proposing a novel decision model for runtime
based cost estimation. gpuQP [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] uses a cost per tuple
computation, which is ne tuned in a startup phase by micro
benchmarks. CoGaDB is using a learning-based approach
with spline interpolation to compute runtime estimations.
However, only our model, using learning-based estimation
on learned data points, is able to represent ne-grained
behavior as we have seen in Section 2.1.
      </p>
      <p>3. Investigating global optimization together with
proposing a search space reduction approach and a
well performing greedy algorithm. To our knowledge,
there is no related work on global query optimization for
heterogeneous computing units. The problem does not apply
for well-known query optimizations, because every operator
can be placed independently without allowing any pruning
of possible solutions.</p>
      <p>
        4. Discussing approaches for placement
granularity, optimization granularity, and system
integration. Placement granularity was discussed for gpuQP [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
where placement is done on primitives, which then build
larger query operators. This approach is similar to our
suboperator granularity. Ocelot [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and gpuDB [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] work on
query-granularity, where the CU is set manually for each
query. We do not have any detailed information about the
optimization granularity or the exact integration level of
optimization for these database systems.
6.
      </p>
    </sec>
    <sec id="sec-16">
      <title>CONCLUSION</title>
      <p>In this thesis, we investigated heterogeneity-aware query
optimization within database systems. We strongly
motivate our direction of query operator placement with a case
study using one operator and multiple CUs. For operator
placement, we investigated several concepts of optimization,
explained possible options, and de ned our approach.
Finally, we propose an ideal system setup by de ning an
integration approach and the speci c steps of the optimization
stage. Our approach is implemented using existing database
systems and an OpenCL based extension approach.</p>
    </sec>
    <sec id="sec-17">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is funded by the German Research Foundation
(DFG) within the Cluster of Excellence \Center for
Advancing Electronics Dresden". Parts of the hardware were
generously provided by Dresden GPU Center of Excellence.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bre</surname>
          </string-name>
          .
          <article-title>The Design and Implementation of CoGaDB: A Column-oriented GPU-accelerated DBMS</article-title>
          .
          <string-name>
            <surname>Datenbank-Spektrum</surname>
          </string-name>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Esmaeilzadeh</surname>
          </string-name>
          , E. Blem,
          <string-name>
            <given-names>R.</given-names>
            <surname>St. Amant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sankaralingam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Burger</surname>
          </string-name>
          .
          <article-title>Dark silicon and the end of multicore scaling</article-title>
          .
          <source>ISCA</source>
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Sander</surname>
          </string-name>
          .
          <source>Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst</source>
          .,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Heimel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saecker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Markl</surname>
          </string-name>
          .
          <article-title>Hardware-Oblivious Parallelism for In-Memory Column-Stores</article-title>
          . PVLDB,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Karnagel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Habich</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Lehner</surname>
          </string-name>
          .
          <article-title>Local vs</article-title>
          . Global Optimization:
          <article-title>Operator Placement Strategies in Heterogeneous Environments</article-title>
          .
          <source>In Proceedings of the Workshops of the EDBT/ICDT</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Karnagel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Habich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schlegel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Lehner</surname>
          </string-name>
          .
          <article-title>Heterogeneity-aware Operator Placement in Column-Store DBMS</article-title>
          .
          <string-name>
            <surname>Datenbank-Spektrum</surname>
          </string-name>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Karnagel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ludwig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Habich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lehner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heimel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Markl</surname>
          </string-name>
          .
          <article-title>Demonstrating e cient query processing in heterogeneous environments</article-title>
          .
          <source>In Proceedings of the 2014 ACM SIGMOD</source>
          , New York, NY, USA. ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Karnagel</surname>
          </string-name>
          , R. Muller, and
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Lohman. Optimizing</surname>
          </string-name>
          GPU-accelerated Group-By and
          <article-title>Aggregation</article-title>
          . In ADMS'
          <volume>15</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <source>The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. Proc. VLDB Endow</source>
          .,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>