<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Self-Tuning Distribution of DB-Operations on Hybrid CPU/GPU Platforms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sebastian Breß</string-name>
          <email>bress@iti.cs.uni-magdeburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siba Mohammad</string-name>
          <email>siba.mohammad@st.ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eike Schallehn</string-name>
          <email>eike@iti.cs.uni-magdeburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Otto-von-Guericke University</institution>
          ,
          <addr-line>Magdeburg</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <abstract>
        <p>A current research trend focuses on accelerating database operations with the help of GPUs (Graphics Processing Units). Since GPU algorithms are not necessarily faster than their CPU counterparts, it is important to use them only if they outperform their CPU counterparts. In this paper, we address this problem by constructing a decision model for a framework that is able to distribute database operations response time minimal on CPUs and GPUs. Furthermore, we discuss necessary quality measures for evaluating our model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In the context of database tuning, there are many di
erent approaches for performance optimization. A new
opportunity for optimization was introduced with General
Purpose Computation on Graphics Processing Units (GPGPU)
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This approach allows to speed up applications that are
suited for parallel processing with the help of GPUs. Parallel
computing architectures like compute uni ed device
architecture (CUDA) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] make it possible to program a GPU
almost as simple as a CPU. This technology opens a new
branch in research that focuses on accelerating applications
using GPUs.
      </p>
      <p>
        CPUs are optimized for a low response time, meaning
that they execute one task as fast as possible. The GPU
is optimized for high throughput, meaning they execute as
many tasks as possible in a xed time. This is
accomplished by massively parallel execution of programs by using
multi threading. Furthermore, GPUs specialize on
computeintensive tasks, which is typical for graphics applications.
Additionally, tasks with much control ow decrease
performance on GPUs, but can be handled well by CPUs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
Consequently, database operations bene t di erently by
using the GPU. Aggregations are most suited for GPU usage,
whereas selections should be outsourced with care. He et al.
observed that selections are 2-3 times slower on the GPU
compared with the CPU [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>New Research Trend</title>
      <p>
        A new research trend focuses on speeding up database
operations by performing them on GPUs [
        <xref ref-type="bibr" rid="ref13 ref14 ref4 ref6">4, 6, 13, 14</xref>
        ].
      </p>
      <p>
        He et al. present the concept and implementation of
relational joins on GPUs [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Pirk et al. develop an approach
to accelerate indexed foreign key joins with GPUs [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The
foreign keys are streamed over the PCIe bus while random
lookups are performed on the GPU. Walkowiak et al.
discuss the usability of GPUs for databases [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. They show the
applicability on the basis of an n-gram based text search
engine. Bakkum et al. develop a concept and implementation
of the SQLite command processor on the GPU [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The main
target of their work is the acceleration of a subset of
possible SQL queries. Govindaraju et al. present an approach
to accelerate selections and aggregations with the help of
GPUs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. From the examples, we can conclude that a GPU
can be an e ective coprocessor for the execution of database
operations.
1.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Problem Statement</title>
      <p>We assume that an operation is executed through an
algorithm which uses either CPU or GPU. For which operations
and data is it e cient to execute database operations on a
GPU? A GPU algorithm can be faster or slower as its CPU
counterpart. Therefore, the GPU should be used in a
meaningful way to achieve maximal performance. That means,
only if it is to be expected that the GPU algorithm is faster
it should be executed.</p>
      <p>We do not know a priori which processing device (CPU
or GPU) is faster for which datasets in a given hardware
con guration. We have to take di erent requirements into
account to determine the fastest processing device:
the operation to be executed,
the size of the dataset to process,
the processing power of CPU and GPU (number of
processor cores, clock rate), and
the current load condition on CPU and GPU.</p>
      <p>The following contributions are discussed in the following:
1. A response time minimal distribution of database
operations on CPUs and GPUs can achieve shorter
execution times for operations and speedup query
executions. Hence, a basic idea how such a model can be
constructed is depicted in this paper.
2. For the evaluation of our model, we de ne appropriate
quality measures.</p>
      <p>To the best of our knowledge, there is no self-tuning
decision model that can distribute database operations response
time minimal on CPUs and GPUs.</p>
      <p>The remainder of the paper is structured as follows. In
Section 2, we present our decision model. Then, we discuss
model quality metrics needed for evaluation in Section 3.
Section 4 provides background about related work. Finally,
we present our conclusion and future work.</p>
    </sec>
    <sec id="sec-4">
      <title>DECISION MODEL</title>
      <p>In this section, we present our decision model. At rst,
we discuss the basic model. Then, we explain details of the
estimation and the decision components.
2.1</p>
    </sec>
    <sec id="sec-5">
      <title>Basic Model</title>
      <p>We construct a model that is able to choose the response
time minimal algorithm from a set of algorithms for
processing a dataset D. We store observations of past algorithm
executions and use statistical methods to interpolate future
execution times. We choose the algorithm with the smallest
estimated execution time for processing a dataset.</p>
      <p>De nitions: Let O be a database operation and let APO =
fA1; ::; Amg be an algorithm pool for operation O, i.e., a
set of algorithms that is available for the execution of O.
We assume that every algorithm has di erent performance
characteristics. Hence, the execution times of di erent
algorithms needed to process a dataset D are likely to vary.
A data set D provides characteristic features of the input
data. In this paper, we consider only the data size. Other
characteristics, e.g., data distribution or data types will be
considered in future work.</p>
      <p>Let TA(D) be the execution time of an algorithm to
process the dataset D. We do not consider hardware speci c
parameters like clock rate and number of processor cores to
estimate execution times. Instead, we learn the execution
behavior of every algorithm. For this purpose, we assign to
every algorithm A a learning method LA and a
corresponding approximation function FA(D). Let Test(A; D)=FA(D)
be an estimated execution time of algorithm A for a dataset
D. A measured execution time is referred to as Treal(A; D).
Let a measurement pair (MP) be a tuple (D,Treal(A; D)).
Let M P LA be a measurement pair list, containing all
current measurement pairs of algorithm A.</p>
      <p>
        Statistical Methods: We consider the following
statistical methods for the approximation of the execution
behavior of all algorithms. The rst one is the least squares
method and the second is spline interpolation with cubic
splines [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We use these approaches because we observe in
our experiments minimal overhead and a good accuracy
(relative error &lt; 10%) of the estimated execution times. While
other methods can learn the execution behavior depending
on more than one feature, they often need a large amount of
time for updating approximation functions and computing
estimations, see Section 4. In the case of the least square
method FA(D) is a polynomial of degree n. In the case of
cubic splines, FA(D) is a spline.
      </p>
      <p>Abstraction: The execution time of algorithms is highly
dependent on speci c parameters of the given processing
hardware. In practice, it is problematic to manage all
parameters, so maintenance would become even more costly.
Hence, we do not consider hardware speci c parameters and
let the model learn the execution behavior of algorithms
using statistical method LA and the corresponding
approximation function FA(D).</p>
      <p>Model Components: The model is composed of three
components. The rst is the algorithm pool APO which
contains all available algorithm of an operation O. The
second is the estimation component which computes execution
time estimations for every algorithm of an operation. The
third part is the decision component which chooses the
algorithm that ts best for the speci ed optimization criteria.
Depending on whether the chosen algorithm uses the CPU
or the GPU, the corresponding operation is executed on the
CPU or the GPU. In this paper, we only consider response
time optimization. However, other optimization criteria like
throughput can be added in future work. Figure 1
summarizes the model structure.
2.2</p>
    </sec>
    <sec id="sec-6">
      <title>Estimation Component</title>
      <p>This section describes the functionality of the estimation
component. First, we discuss when the approximation
functions of the algorithms should be updated. Second, we
examine how the model can adapt to load changes.
2.2.1</p>
      <sec id="sec-6-1">
        <title>Updating the Approximation Functions</title>
        <p>For an operation O that will be applied on a dataset D
the estimation component computes for each available
algorithm of O an estimated execution time. For this, we need
an approximation function that can be used to compute
estimations. Since we learn such functions with statistical
methods, we need a number of observations for each algorithm
to compute the corresponding approximation functions.
After we computed an initial function for each algorithm, our
model is used to make decisions. Hence, the model
operation can be divided into two phases. The rst is the initial
training phase. The second is the operational phase.</p>
        <p>Since the load condition can change over time, execution
times of algorithms are likely to change as well. Hence, the
model should provide a mechanism for an adoption to load
changes. Therefore, the model continuously collects
measurement pairs of all executed algorithms of an operation.</p>
        <p>There are two problems to solve:
1. Re-computation Problem: nd a strategy when to
re-compute the approximation function for each
algorithm, so that a good trade-o between accuracy and
overhead is achieved.
2. Cleanup Problem: delete outdated measurements
pairs from a measurement pair list because the
consideration of measurement pairs from past load
conditions is likely to be less bene cial for estimation
accuracy. Furthermore, every measurement pair needs
storage space and results in higher processing time for
the statistical methods.</p>
        <p>
          There are di erent heuristics that deal with the
re-computation of approximation functions. Zhang et al. present
possible approaches in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. These are:
1. Re-compute the approximation function always after a
new measurement pair was added to the measurement
pair list.
2. Re-compute the approximation function periodically.
3. Re-compute the approximation function, if the
estimation error exceeds a certain threshold.
operation O
        </p>
        <p>A1,</p>
        <p>A2,...,
algorithm pool An
CPU GPU
dataset D
optimization criterion</p>
        <sec id="sec-6-1-1">
          <title>Test(A1,D),</title>
          <p>Test(A2,D),...,
estimation component Test(An,D)</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>MPLA1 ... MPLAi ... MPLAn</title>
          <p>MP=(D,Treal(Ai))
decision component</p>
          <p>Ai</p>
          <p>As Zhang et al. point out, the most aggressive method is
unlikely to achieve good results because the expected
overhead for re-computing the approximation function
counteracts possible performance gains. Hence, we have to choose
between approach 2 and 3. There is no guarantee that one
approach causes less overhead than the other.</p>
          <p>On one hand, periodic re-computation causes a predictable
overhead, but it may re-compute approximation functions
when it is not necessary, e.g., the relative estimation error
is still beneath an error threshold. On the other hand, an
event based approach re-computes the approximation
functions only if it is necessary, but has the disadvantage, that
the quality of estimations has to decrease until the
approximation functions are re-computed. We choose the periodic
approach, because of it's predictable overhead. We will
consider the event based approach in future work.</p>
          <p>Parameters: Now we discuss necessary parameters of
the model. Let RCR be the re-computation rate of an
algorithm. That means that after RCR measurement pairs were
added to the measurement pair list of an algorithm A, the
approximation function FA(D) of A is re-computed.1 Hence,
the used time measure is the number of added measurement
pairs. Let ITL be the initial training length, which is the
number of measurement pairs that have to be collected for
each algorithm, before the model switches into its
operational phase.</p>
          <p>Now we address the Cleanup Problem. We have to limit
the number of measurement pairs in the measurement pair
list to keep space and computation requirements low.
Furthermore, we want to delete old measurement pairs from the
list that do not contribute to our current estimation problem
su ciently. These requirements are ful lled by a ring bu er
data structure. Let RBS be the ring bu er size in number of
measurement pairs. If a new measurement pair is added to
a full ring bu er, it overrides the oldest measurement pair.
Hence, the usage of a ring bu er solves the Cleanup Problem.
RCR and RBS are related because if RCR is greater than
RBS there would be measurement pairs that are never
considered for computing the approximation functions. Hence,
RCR should be smaller than RBS .</p>
          <p>
            Statistical Methods: The used statistical methods have
to be computationally e cient when computing
approximation functions and estimation values. Additionally, they
should provide good estimations (relative estimation error
&lt;10%). Since the times needed to compute estimations
sums up over time, we consider a method computationally
e cient if it can compute one estimation value in less than
50 s. Our experiments with the ALGLIB [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] show that this
is the case for the least squares and cubic splines.
1For each operation one measurement pair is added to the
list of the selected algorithm.
          </p>
          <p>Self Tuning Cycle: The model performs the following
self tuning cycle during the operational phase:
1. Use approximation functions to compute execution time
estimations for all algorithms in the algorithm pool of
operation O for the dataset D.
2. Select the algorithm with the minimal estimated
response time.
3. Execute the selected algorithm and measure its
execution time. Add the new measurement pair to the
measurement pair list M P LA of the executed algorithm
A.
4. If the new measurement pair is the RCR new pair in
the list, then the approximation function of the
corresponding algorithm will be re-computed using the
assigned statistical method.
2.2.2</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>Adaption to Load Changes</title>
        <p>This section focuses on the adaption to load changes of the
model. We start with necessary assumptions, then proceed
with the discussion how the model can adapt to new load
conditions.</p>
        <p>Let ACP U be a CPU algorithm and AGP U a GPU
algorithm for the same operation O.</p>
        <p>The basic assumptions are:
1. Every algorithm is executed on a regular basis, even in
overload conditions. That ensures a continuing supply
of new measurement pairs. If this assumption holds
then a gradual adaption of the approximation
functions to a changed load condition is possible.
2. A change in the load condition has to last a certain
amount of time, otherwise the model cannot adapt
and delivers more vague execution time estimations.
If this assumption does not hold, it is not possible to
continuously deliver good estimations.
3. The load condition of CPU and GPU are not related.</p>
        <p>As long as the assumptions are ful lled, the estimated
execution time curves2 approach the real execution time curves
if the load changes. Increase and decrease of the load
condition on the side of the CPU is symmetric compared to an
equivalent change on the side of the GPU. For simplicity, but
without the loss of generality, we discuss the load adaption
mechanism for the CPU side.
2An execution time curve is the graphical representation of
all execution times an algorithm needs to process all datasets
in a workload.
e
m
i
t
n
o
i
t
u
c
e
x
e
e
m
i
t
n
o
i
t
u
c
e
x
e</p>
        <p>Treal(AGPU,D)
Treal(ACPU,D)
Test (ACPU,D)
data size
data size</p>
        <p>Increasing Load Condition: If the load condition
increases on the side of the CPU then the execution times of
CPU algorithms increase and the real execution time curves
are shifted upwards. In contrast, the estimated execution
time curves stay as they are. We illustrate this situation in
Figure 2. The model is adapted to the new load condition
by collecting new measurement pairs and recomputing the
approximation function. This will shift the estimated
execution time curve in the direction of the measured execution
time curve. Hence, the estimations become more precise.
After at maximum RCR newly added measurement pairs
for one algorithm the estimated execution time curve
approaches the real execution time curve. This implies the
assumption that a re-computation of approximation functions
never has a negative impact on the estimation accuracy.</p>
        <p>Decreasing Load Condition: The case of a decreasing
load is mostly symmetric to the case of increased load. A
decreased load on CPU side causes shorter execution time
of CPU algorithms. The consequence is that real execution
time curves are shifted downwards, whereas the estimated
execution time curves stay as they are.</p>
        <p>Limits of the Adaption: There is the possibility that
the described load adaption scheme can break, if the load
on CPU side increases or decreases too much. We consider
the case, where the load increases too much, the other case
is symmetric. In Figure 3, we illustrate the former case.
Hence, the real execution time curve of ACP U lies above the
real execution time curve of AGP U .</p>
        <p>If this state continues to reside a certain amount of time,
the estimated execution time curves will approach the real
execution time curves. If the load condition normalizes
again, then only algorithm AGP U is executed, regardless of
datasets that can be faster processed by algorithm ACP U .
The model is now stuck in an erroneous state since it
cannot make response time minimal decisions for operation O
anymore.</p>
        <p>However, this cannot happen due to the assumption that
every algorithm is executed on a regular basis, even in
overload conditions. Hence, a continuing supply of new
measurement pairs is ensured which allows the adaption of the
model to the current load condition.</p>
        <p>If we optimize the operation execution for response time,
we want to select the algorithm with the minimal
execution time for the dataset D. In choosing the algorithm with
the minimal execution time, we distribute the operation
response time minimal to CPU and GPU. There is always one
decision per operation that considers the size of the dataset
that has to be processed.</p>
        <p>De nition: Let a workload W be a tuple W = (DS; O),
where DS = D1; D2; ; Dn is a set of datasets Di that are
to be processed and O the operation to be executed.3</p>
        <p>Goal: The goal is to choose the fastest algorithm Aj 2
APO for every dataset Di 2 DS for the execution of
operation O.</p>
        <p>min = X Treal(Ak; Di) with Ak 2 APO</p>
        <p>Di2DS</p>
        <p>Usage: The algorithm with the minimal estimated
execution time for the dataset D is chosen for execution. The
function choose Algorithm (choose Alg) chooses an algorithm
according to the optimization criterion.</p>
        <p>choose Alg(D; O) = Aj with</p>
        <p>Test(Aj; D) = min(fTest(Ak; D)j8Ak 2 APOg)</p>
        <p>It is expected that the accuracy of the estimated execution
times has a large impact on the decisions and the model
quality.</p>
        <p>Practical Use: The execution of the fastest algorithm
reduces the response time of the system and results in better
performance of a DBS. These response time optimization is
necessary to accelerate time critical tasks4. Furthermore, it
3For simplicity, we only consider one operation per
workload. However, a set of operations is more realistic and can
be added in future work.
4A task is an operation in execution.
(1)
(2)
is possible to automatically ne-tune algorithm selection on
a speci c hardware con guration.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>MODEL QUALITY CRITERIA</title>
      <p>In this section, we present four model quality measures,
namely average percentage estimation error, hit rate, model
quality, and percentage speed increase.
3.1</p>
      <p>
        Average Percentage Estimation Error
The idea of the average percentage estimation error is to
compute the estimation error for each executed algorithm
and the corresponding estimation value. Then, the
absolute percentage estimation error is computed. Finally, the
average of all computed percentage estimation errors is
computed. This measure is also called relative error and is used,
e.g., in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
3.2
      </p>
    </sec>
    <sec id="sec-8">
      <title>Hit Rate</title>
      <p>The idea of this measure is to compute the ratio of the
number of correct decisions and the total number of
decisions. A decision is correct, if and only if the model decided
for the fastest algorithm. In the ideal case all decisions are
correct, so the hit rate is 100%. In the worst case all
decisions are wrong. Hence, the hit rate is 0%. The bene t
of the hit rate is that it can provide statistical information
about how many decisions are wrong. If the hit rate is X
then every 1/(1-X) decision is incorrect. However, we
cannot quantify the impact of an incorrect decision. This is
addressed by the model quality. Note, that algorithms with
very similar execution time curves can lead to a bad hit rate,
although the overall performance is acceptable.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Model Quality</title>
      <p>The idea of the model quality is to compute the ratio of
the resulting execution times that a system using an ideal
model and a real model needs to process a workload W .
In the ideal case, the real model would be as good as the
ideal model. Hence, the model quality is 100%. The worst
case model would always select the slowest algorithm and
provides a lower bound for the model quality.</p>
      <p>Let TDM (W ) be the time a system using a decision model
(DM ) needs to process a workload W . It is computed out of
the sum of all algorithm execution times resulting from the
algorithm selection of the decision model for the workload
added with the overhead caused by DM . Let TOh(DM; W )
be the overhead the decision model DM causes. If a decision
model introduces to much overhead, it can eliminate their
gain. Hence the overhead has to be considered, which is the
sum of the total time needed for computing estimation
values and the total time needed to re-compute approximation
functions. The time needed for all estimation value
computation (EVC) for a workload W is TEVC(W ). The time needed
to re-compute the approximation functions of all algorithms
is TRC (W ). Both measures are highly dependent on DM .
Hence, we get the following formulas:</p>
      <p>TOh(DM; W ) = TEVC(W ) + TRC (W )
(3)
TDM (W ) =</p>
      <p>X Treal(choose AlgDM (D; O); D) (4)</p>
      <p>D2DS</p>
      <p>Let DMideal be the ideal decision model and let DMreal
be the real decision model. Then, the model quality MQ is
de ned as:</p>
      <p>M Q(W; DMreal) = TDMideal (W )</p>
      <p>TDMreal (W )
(5)</p>
      <p>This measure describes to which degree the optimization
goal described in formula 1 is achieved. Note, that the ideal
model is not introducing overhead (TOh(DMideal; W ) = 0).
3.4</p>
    </sec>
    <sec id="sec-10">
      <title>Percentage Speed Increase</title>
      <p>The idea of the percentage speed increase (PSI) is to
quantify the performance gain, if a decision model DMi is
replaced with a decision model DMj .</p>
      <p>P SI(DMi ! DMj ; W ) =</p>
      <p>TDMi (W )</p>
      <p>TDMj (W )
TDMi (W )
(6)
If P SI(DMi ! DMj ; W ) is greater than zero, DMj had a
better performance than DMi and vice versa.
4.</p>
    </sec>
    <sec id="sec-11">
      <title>RELATED WORK</title>
      <p>In this section, we present the related work. We
consider analytical models for estimating execution times of
GPU algorithms. Furthermore, we discuss learning based
approaches to estimate execution times. Finally, we present
existing decision models and provide a comparison to our
approach.</p>
      <p>
        Analytical Models Hong et al. present an analytical
model which estimates execution times of massively parallel
programs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The basic idea is to estimate the number of
parallel memory requests. The approach uses information
like number of running threads and the memory bandwidth.
Depending on their case study, relative estimation errors
between 5,4% and 13,3% are observed.
      </p>
      <p>
        Zhang et al. develop a model that should help optimizing
GPU programs by arranging qualitative performance
analyses [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. They use a micro benchmark-based approach which
needs hardware speci c parameters, e.g., the number of
processors or the clock rate. The model cannot adapt to load
changes and therefore, it is not considered for use in our
approach. The observed relative estimation error lies between
5% and 15%.
      </p>
      <p>
        Learning based Execution Time Estimation Akdere
et al. investigate modeling techniques for analytical
workloads [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. They estimate the execution behavior of queries
on the basis of di erent granularities. They presented a
modeling approach for the estimation on query level and
operation level. The basic idea of their approach is to
perform a feature extraction on queries and compute execution
time estimations based on them.
      </p>
      <p>
        Matsunaga et al. present a short overview over machine
learning algorithms and their elds of application [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
goal is the estimation of the resource usage for an
application. One considered resource is the execution time. The
developed method PQR2 needs a few milliseconds for the
computation of estimations. Since we need for n datasets
and m algorithms n m computations the time for a single
computation should be less than 50 s to keep the overhead
at an absolute minimum. Hence, the approach of Matsunaga
et al. is to be investigated, whether it achieves good results
for a response time minimal operation distribution. This
can be done in future work.
      </p>
      <p>
        Zhang et al. present a model for predicting the costs of
complex XML queries [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. They use the statistical
learning method called "transform regression technique". The
approach allows a self-tuning query optimizer that can
dynamically adapt to load condition changes. The approach
of Zhang and our model are similar in basic functionalities.
The di erence to our approach is that Zhang et al. estimate
the execution time of XML queries whereas our approach
estimates execution times of single database operations to
dynamically distribute them on CPU and GPU.
Additionally, we use a di erent learning method that is optimized for
computational e ciency.
      </p>
      <p>
        Decision Models Kerr et al. present a model that
predicts the performance of similar application classes for di
erent processors [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The approach allows to choose between
CPU and GPU implementation. This choice is made
statically in contrast to our work, where an algorithm for an
operation execution is chosen dynamically at runtime. Kerr
et al. are using only parameters that are statically known
before program execution. Hence, it allows no adaption to
load changes in contrast to our model that allows load
adaption.
      </p>
      <p>
        Iverson et al. develop an approach that estimates
execution times of tasks in the context of distributed systems [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
The approach, similar to our model, does not require
hardware speci c information. They use the learning method
knearest-neighbor, a non-parametric regression method. We
use least squares and cubic splines that are parametric
regression methods which need less time for computing
estimations compared to non parametric regression methods.
The goal of Iverson et al. is an optimal selection of nodes
in a distributed system, where a task is executed. Unlike
this approach, our work has the goal for optimal algorithm
selection. It is possible to apply the approach of Iverson et
al. on hybrid CPU/GPU platform. However, we consider,
the GPU is a coprocessor of the CPU. Hence, a CPU/GPU
platform is not a distributed system from our point of view.
In a way, our model is less general than the model of Iverson
et al.
      </p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSION</title>
      <p>In this paper, we addressed a current research problem,
namely the optimal distribution of database operations on
hybrid CPU/GPU platforms. Furthermore, we develop a
self-tuning decision model that is able to distribute database
operations on CPUs and GPUs in a response time
minimal manner. We discuss the basic structure of the model
and provide a qualitative argumentation of how the model
works. Additionally, we present suitable model quality
measures, which are required to evaluate our model. Our
experiments show that our model is almost as good as the ideal
model if the model parameters are set appropriately. We
omit the evaluation due to limited space. We conclude that
distributing database operations on CPUs and GPUs has
a large optimization potential. We believe our model is a
further step to address this issue.</p>
      <p>In ongoing research, we use the de ned quality measures
to evaluate our model on suitable use cases. A further
possible extension is using the model in a more general context,
where response-time optimal decisions have to be made, e.g.,
an optimal index usage. Furthermore, an extension of the
model from operations to queries is necessary for better
applicability of our model. This leads to the problem of hybrid
query plan optimization.
6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Akdere</surname>
          </string-name>
          and
          <string-name>
            <given-names>U.</given-names>
            <surname>Cetintemel</surname>
          </string-name>
          .
          <article-title>Learning-based query performance modeling and prediction</article-title>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>ALGLIB</given-names>
            <surname>Project. ALGLIB</surname>
          </string-name>
          . http://www.alglib.net/,
          <year>2012</year>
          . [Online; accessed 05-January-2012].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. R. Anthony</given-names>
            <surname>Ralston</surname>
          </string-name>
          .
          <article-title>A rst course in numerical analysis</article-title>
          .
          <source>dover publications</source>
          ,
          <source>second edition</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bakkum</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Skadron</surname>
          </string-name>
          .
          <article-title>Accelerating sql database operations on a gpu with cuda</article-title>
          .
          <source>GPGPU '10</source>
          , pages
          <fpage>94</fpage>
          {
          <fpage>103</fpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Manocha</surname>
          </string-name>
          .
          <article-title>Fast computation of database operations using graphics processors</article-title>
          .
          <source>SIGMOD '04</source>
          , pages
          <fpage>215</fpage>
          {
          <fpage>226</fpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Sander</surname>
          </string-name>
          .
          <article-title>Relational query coprocessing on graphics processors</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .,
          <volume>34</volume>
          :21:1{
          <fpage>21</fpage>
          :
          <fpage>39</fpage>
          ,
          <year>December 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Sander</surname>
          </string-name>
          .
          <article-title>Relational joins on graphics processors</article-title>
          .
          <source>SIGMOD '08</source>
          , pages
          <fpage>511</fpage>
          {
          <fpage>524</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hong</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness</article-title>
          .
          <source>ACM SIGARCH Computer Architecture News</source>
          ,
          <volume>37</volume>
          (
          <issue>3</issue>
          ):
          <fpage>152</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Iverson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ozguner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Follen</surname>
          </string-name>
          .
          <article-title>Run-time statistical estimation of task execution times for heterogeneous distributed computing</article-title>
          .
          <source>HPDC '96</source>
          , pages
          <fpage>263</fpage>
          {
          <fpage>270</fpage>
          , Washington, DC, USA,
          <year>1996</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kerr</surname>
          </string-name>
          , G. Diamos, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Yalamanchili</surname>
          </string-name>
          .
          <article-title>Modeling gpu-cpu workloads and systems</article-title>
          .
          <source>GPGPU '10</source>
          , pages
          <fpage>31</fpage>
          {
          <fpage>42</fpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Matsunaga</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. A. B.</given-names>
            <surname>Fortes</surname>
          </string-name>
          .
          <article-title>On the use of machine learning to predict the time and resources consumed by applications</article-title>
          .
          <source>CCGRID</source>
          , pages
          <volume>495</volume>
          {
          <fpage>504</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>NVIDIA. NVIDIA CUDA C Programming</surname>
          </string-name>
          <article-title>Guide</article-title>
          . http://developer.download.nvidia.com/compute/ DevZone/docs/html/C/doc/CUDA_C_
          <article-title>Programming_ Guide</article-title>
          .pdf,
          <year>2012</year>
          . pages 1
          <issue>{5</issue>
          , Version 4.0, [Online; accessed 1-February-2012].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kersten</surname>
          </string-name>
          .
          <article-title>Accelerating foreign-key joins using asymmetric memory channels</article-title>
          .
          <source>ADMS '11</source>
          , pages
          <fpage>585</fpage>
          {
          <fpage>597</fpage>
          . VLDB Endowment,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Walkowiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wawruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nowotka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ligowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Rudnicki</surname>
          </string-name>
          .
          <article-title>Exploring utilisation of gpu for database applications</article-title>
          .
          <source>Procedia Computer Science</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>505</volume>
          {
          <fpage>513</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Josifovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Lohman</surname>
          </string-name>
          , and
          <string-name>
            <surname>C. Zhang.</surname>
          </string-name>
          <article-title>Statistical learning techniques for costing xml queries</article-title>
          .
          <source>VLDB '05</source>
          , pages
          <fpage>289</fpage>
          {
          <fpage>300</fpage>
          . VLDB Endowment,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Owens</surname>
          </string-name>
          .
          <article-title>A quantitative performance analysis model for gpu architectures</article-title>
          .
          <source>Computer Engineering</source>
          , pages
          <volume>382</volume>
          {
          <fpage>393</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>