=Paper= {{Paper |id=None |storemode=property |title=Self-Tuning Distribution of DB-Operations on Hybrid CPU/GPU Platforms |pdfUrl=https://ceur-ws.org/Vol-850/paper_bress.pdf |volume=Vol-850 |dblpUrl=https://dblp.org/rec/conf/gvd/BressMS12 }} ==Self-Tuning Distribution of DB-Operations on Hybrid CPU/GPU Platforms== https://ceur-ws.org/Vol-850/paper_bress.pdf
        Self-Tuning Distribution of DB-Operations on Hybrid
                        CPU/GPU Platforms

             Sebastian Breß                              Siba Mohammad                           Eike Schallehn
       Otto-von-Guericke University                 Otto-von-Guericke University           Otto-von-Guericke University
               Magdeburg                                    Magdeburg                               Magdeburg
      bress@iti.cs.uni-magdeburg.de                 siba.mohammad@st.ovgu.de               eike@iti.cs.uni-magdeburg.de


ABSTRACT                                                              1.1   New Research Trend
A current research trend focuses on accelerating database                A new research trend focuses on speeding up database
operations with the help of GPUs (Graphics Processing Units).         operations by performing them on GPUs [4, 6, 13, 14].
Since GPU algorithms are not necessarily faster than their               He et al. present the concept and implementation of rela-
CPU counterparts, it is important to use them only if they            tional joins on GPUs [6, 7]. Pirk et al. develop an approach
outperform their CPU counterparts. In this paper, we ad-              to accelerate indexed foreign key joins with GPUs [13]. The
dress this problem by constructing a decision model for a             foreign keys are streamed over the PCIe bus while random
framework that is able to distribute database operations re-          lookups are performed on the GPU. Walkowiak et al. dis-
sponse time minimal on CPUs and GPUs. Furthermore,                    cuss the usability of GPUs for databases [14]. They show the
we discuss necessary quality measures for evaluating our              applicability on the basis of an n-gram based text search en-
model.                                                                gine. Bakkum et al. develop a concept and implementation
                                                                      of the SQLite command processor on the GPU [4]. The main
                                                                      target of their work is the acceleration of a subset of pos-
1.   INTRODUCTION                                                     sible SQL queries. Govindaraju et al. present an approach
   In the context of database tuning, there are many differ-          to accelerate selections and aggregations with the help of
ent approaches for performance optimization. A new oppor-             GPUs [5]. From the examples, we can conclude that a GPU
tunity for optimization was introduced with General Pur-              can be an effective coprocessor for the execution of database
pose Computation on Graphics Processing Units (GPGPU)                 operations.
[12]. This approach allows to speed up applications that are
suited for parallel processing with the help of GPUs. Parallel        1.2   Problem Statement
computing architectures like compute unified device archi-               We assume that an operation is executed through an algo-
tecture (CUDA) [12] make it possible to program a GPU                 rithm which uses either CPU or GPU. For which operations
almost as simple as a CPU. This technology opens a new                and data is it efficient to execute database operations on a
branch in research that focuses on accelerating applications          GPU? A GPU algorithm can be faster or slower as its CPU
using GPUs.                                                           counterpart. Therefore, the GPU should be used in a mean-
   CPUs are optimized for a low response time, meaning                ingful way to achieve maximal performance. That means,
that they execute one task as fast as possible. The GPU               only if it is to be expected that the GPU algorithm is faster
is optimized for high throughput, meaning they execute as             it should be executed.
many tasks as possible in a fixed time. This is accom-                   We do not know a priori which processing device (CPU
plished by massively parallel execution of programs by using          or GPU) is faster for which datasets in a given hardware
multi threading. Furthermore, GPUs specialize on compute-             configuration. We have to take different requirements into
intensive tasks, which is typical for graphics applications.          account to determine the fastest processing device:
Additionally, tasks with much control flow decrease perfor-
                                                                         • the operation to be executed,
mance on GPUs, but can be handled well by CPUs [12].
Consequently, database operations benefit differently by us-             • the size of the dataset to process,
ing the GPU. Aggregations are most suited for GPU usage,
whereas selections should be outsourced with care. He et al.             • the processing power of CPU and GPU (number of
observed that selections are 2-3 times slower on the GPU                   processor cores, clock rate), and
compared with the CPU [6].                                               • the current load condition on CPU and GPU.
                                                                        The following contributions are discussed in the following:
                                                                        1. A response time minimal distribution of database op-
                                                                           erations on CPUs and GPUs can achieve shorter exe-
                                                                           cution times for operations and speedup query execu-
                                                                           tions. Hence, a basic idea how such a model can be
                                                                           constructed is depicted in this paper.
24th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 29.05.2012 - 01.06.2012, Lübbenau, Germany.                    2. For the evaluation of our model, we define appropriate
Copyright is held by the author/owner(s).                                  quality measures.
   To the best of our knowledge, there is no self-tuning deci-   mation function FA (D).
sion model that can distribute database operations response         Model Components: The model is composed of three
time minimal on CPUs and GPUs.                                   components. The first is the algorithm pool APO which
   The remainder of the paper is structured as follows. In       contains all available algorithm of an operation O. The sec-
Section 2, we present our decision model. Then, we discuss       ond is the estimation component which computes execution
model quality metrics needed for evaluation in Section 3.        time estimations for every algorithm of an operation. The
Section 4 provides background about related work. Finally,       third part is the decision component which chooses the al-
we present our conclusion and future work.                       gorithm that fits best for the specified optimization criteria.
                                                                 Depending on whether the chosen algorithm uses the CPU
2.    DECISION MODEL                                             or the GPU, the corresponding operation is executed on the
                                                                 CPU or the GPU. In this paper, we only consider response
  In this section, we present our decision model. At first,
                                                                 time optimization. However, other optimization criteria like
we discuss the basic model. Then, we explain details of the
                                                                 throughput can be added in future work. Figure 1 summa-
estimation and the decision components.
                                                                 rizes the model structure.
2.1   Basic Model                                                2.2     Estimation Component
   We construct a model that is able to choose the response
                                                                    This section describes the functionality of the estimation
time minimal algorithm from a set of algorithms for process-
                                                                 component. First, we discuss when the approximation func-
ing a dataset D. We store observations of past algorithm
                                                                 tions of the algorithms should be updated. Second, we ex-
executions and use statistical methods to interpolate future
                                                                 amine how the model can adapt to load changes.
execution times. We choose the algorithm with the smallest
estimated execution time for processing a dataset.               2.2.1    Updating the Approximation Functions
   Definitions: Let O be a database operation and let APO =
                                                                    For an operation O that will be applied on a dataset D
{A1 , .., Am } be an algorithm pool for operation O, i.e., a
                                                                 the estimation component computes for each available algo-
set of algorithms that is available for the execution of O.
                                                                 rithm of O an estimated execution time. For this, we need
We assume that every algorithm has different performance
                                                                 an approximation function that can be used to compute esti-
characteristics. Hence, the execution times of different al-
                                                                 mations. Since we learn such functions with statistical meth-
gorithms needed to process a dataset D are likely to vary.
                                                                 ods, we need a number of observations for each algorithm
A data set D provides characteristic features of the input
                                                                 to compute the corresponding approximation functions. Af-
data. In this paper, we consider only the data size. Other
                                                                 ter we computed an initial function for each algorithm, our
characteristics, e.g., data distribution or data types will be
                                                                 model is used to make decisions. Hence, the model opera-
considered in future work.
                                                                 tion can be divided into two phases. The first is the initial
   Let TA (D) be the execution time of an algorithm to pro-
                                                                 training phase. The second is the operational phase.
cess the dataset D. We do not consider hardware specific
                                                                    Since the load condition can change over time, execution
parameters like clock rate and number of processor cores to
                                                                 times of algorithms are likely to change as well. Hence, the
estimate execution times. Instead, we learn the execution
                                                                 model should provide a mechanism for an adoption to load
behavior of every algorithm. For this purpose, we assign to
                                                                 changes. Therefore, the model continuously collects mea-
every algorithm A a learning method LA and a correspond-
                                                                 surement pairs of all executed algorithms of an operation.
ing approximation function FA (D). Let Test (A, D)=FA (D)
                                                                    There are two problems to solve:
be an estimated execution time of algorithm A for a dataset
D. A measured execution time is referred to as Treal (A, D).       1. Re-computation Problem: find a strategy when to
Let a measurement pair (MP) be a tuple (D,Treal (A, D)).              re-compute the approximation function for each algo-
Let M P LA be a measurement pair list, containing all cur-            rithm, so that a good trade-off between accuracy and
rent measurement pairs of algorithm A.                                overhead is achieved.
   Statistical Methods: We consider the following statis-
tical methods for the approximation of the execution be-           2. Cleanup Problem: delete outdated measurements
havior of all algorithms. The first one is the least squares          pairs from a measurement pair list because the consi-
method and the second is spline interpolation with cubic              deration of measurement pairs from past load condi-
splines [3]. We use these approaches because we observe in            tions is likely to be less beneficial for estimation ac-
our experiments minimal overhead and a good accuracy (rel-            curacy. Furthermore, every measurement pair needs
ative error < 10%) of the estimated execution times. While            storage space and results in higher processing time for
other methods can learn the execution behavior depending              the statistical methods.
on more than one feature, they often need a large amount of
time for updating approximation functions and computing            There are different heuristics that deal with the re-com-
estimations, see Section 4. In the case of the least square      putation of approximation functions. Zhang et al. present
method FA (D) is a polynomial of degree n. In the case of        possible approaches in [15]. These are:
cubic splines, FA (D) is a spline.
   Abstraction: The execution time of algorithms is highly         1. Re-compute the approximation function always after a
dependent on specific parameters of the given processing              new measurement pair was added to the measurement
hardware. In practice, it is problematic to manage all pa-            pair list.
rameters, so maintenance would become even more costly.            2. Re-compute the approximation function periodically.
Hence, we do not consider hardware specific parameters and
let the model learn the execution behavior of algorithms us-       3. Re-compute the approximation function, if the estima-
ing statistical method LA and the corresponding approxi-              tion error exceeds a certain threshold.
                       operation O                    dataset D                      optimization criterion
                                     A1,                                Test(A1,D),
                                     A2,...,                            Test(A2,D),...,
                      algorithm pool An      estimation component Test(An,D)
                      CPU       GPU           MPLA1 ... MPLAi ... MPLAn
                                                                                        decision component

                                                                        MP=(D,Treal(Ai))
                                                                                                Ai

                                         Figure 1: Information flow in the model


   As Zhang et al. point out, the most aggressive method is         Self Tuning Cycle: The model performs the following
unlikely to achieve good results because the expected over-       self tuning cycle during the operational phase:
head for re-computing the approximation function counter-
acts possible performance gains. Hence, we have to choose              1. Use approximation functions to compute execution time
between approach 2 and 3. There is no guarantee that one                  estimations for all algorithms in the algorithm pool of
approach causes less overhead than the other.                             operation O for the dataset D.
   On one hand, periodic re-computation causes a predictable
overhead, but it may re-compute approximation functions                2. Select the algorithm with the minimal estimated re-
when it is not necessary, e.g., the relative estimation error             sponse time.
is still beneath an error threshold. On the other hand, an
                                                                       3. Execute the selected algorithm and measure its execu-
event based approach re-computes the approximation func-
                                                                          tion time. Add the new measurement pair to the mea-
tions only if it is necessary, but has the disadvantage, that
                                                                          surement pair list M P LA of the executed algorithm
the quality of estimations has to decrease until the approx-
                                                                          A.
imation functions are re-computed. We choose the periodic
approach, because of it’s predictable overhead. We will con-           4. If the new measurement pair is the RCR new pair in
sider the event based approach in future work.                            the list, then the approximation function of the cor-
   Parameters: Now we discuss necessary parameters of                     responding algorithm will be re-computed using the
the model. Let RCR be the re-computation rate of an algo-                 assigned statistical method.
rithm. That means that after RCR measurement pairs were
added to the measurement pair list of an algorithm A, the          2.2.2     Adaption to Load Changes
approximation function FA (D) of A is re-computed.1 Hence,           This section focuses on the adaption to load changes of the
the used time measure is the number of added measurement          model. We start with necessary assumptions, then proceed
pairs. Let ITL be the initial training length, which is the       with the discussion how the model can adapt to new load
number of measurement pairs that have to be collected for         conditions.
each algorithm, before the model switches into its opera-            Let ACP U be a CPU algorithm and AGP U a GPU algo-
tional phase.                                                     rithm for the same operation O.
   Now we address the Cleanup Problem. We have to limit              The basic assumptions are:
the number of measurement pairs in the measurement pair
list to keep space and computation requirements low. Fur-              1. Every algorithm is executed on a regular basis, even in
thermore, we want to delete old measurement pairs from the                overload conditions. That ensures a continuing supply
list that do not contribute to our current estimation problem             of new measurement pairs. If this assumption holds
sufficiently. These requirements are fulfilled by a ring buffer           then a gradual adaption of the approximation func-
data structure. Let RBS be the ring buffer size in number of              tions to a changed load condition is possible.
measurement pairs. If a new measurement pair is added to
a full ring buffer, it overrides the oldest measurement pair.          2. A change in the load condition has to last a certain
Hence, the usage of a ring buffer solves the Cleanup Problem.             amount of time, otherwise the model cannot adapt
RCR and RBS are related because if RCR is greater than                    and delivers more vague execution time estimations.
RBS there would be measurement pairs that are never con-                  If this assumption does not hold, it is not possible to
sidered for computing the approximation functions. Hence,                 continuously deliver good estimations.
RCR should be smaller than RBS .
   Statistical Methods: The used statistical methods have              3. The load condition of CPU and GPU are not related.
to be computationally efficient when computing approxima-
tion functions and estimation values. Additionally, they             As long as the assumptions are fulfilled, the estimated exe-
should provide good estimations (relative estimation error        cution time curves2 approach the real execution time curves
<10%). Since the times needed to compute estimations              if the load changes. Increase and decrease of the load con-
sums up over time, we consider a method computationally           dition on the side of the CPU is symmetric compared to an
efficient if it can compute one estimation value in less than     equivalent change on the side of the GPU. For simplicity, but
50µs. Our experiments with the ALGLIB [2] show that this          without the loss of generality, we discuss the load adaption
is the case for the least squares and cubic splines.              mechanism for the CPU side.
                                                                   2
                                                                   An execution time curve is the graphical representation of
1
  For each operation one measurement pair is added to the         all execution times an algorithm needs to process all datasets
list of the selected algorithm.                                   in a workload.
                         Treal(AGPU,D)                                                     Treal(AGPU,D)
                         Treal(ACPU,D)                                                     Treal(ACPU,D)
        execution time   Test (ACPU,D)                                                     Test (ACPU,D)




                                                                          execution time
                                         data size                                                         data size



Figure 2: Example for increase load condition on                 Figure 3: Example for strong increase of load on
CPU side                                                         CPU side


   Increasing Load Condition: If the load condition in-          2.3     Decision Component
creases on the side of the CPU then the execution times of          In this section, we describe the decision component of our
CPU algorithms increase and the real execution time curves       model. Due to limited space, we introduce only one opti-
are shifted upwards. In contrast, the estimated execution        mization criterion, namely response time, but other criteria
time curves stay as they are. We illustrate this situation in    like throughput are possible.
Figure 2. The model is adapted to the new load condition
by collecting new measurement pairs and recomputing the          2.3.1       Response Time Optimization
approximation function. This will shift the estimated execu-        If we optimize the operation execution for response time,
tion time curve in the direction of the measured execution       we want to select the algorithm with the minimal execu-
time curve. Hence, the estimations become more precise.          tion time for the dataset D. In choosing the algorithm with
After at maximum RCR newly added measurement pairs               the minimal execution time, we distribute the operation re-
for one algorithm the estimated execution time curve ap-         sponse time minimal to CPU and GPU. There is always one
proaches the real execution time curve. This implies the as-     decision per operation that considers the size of the dataset
sumption that a re-computation of approximation functions        that has to be processed.
never has a negative impact on the estimation accuracy.             Definition: Let a workload W be a tuple W = (DS, O),
   Decreasing Load Condition: The case of a decreasing           where DS = D1 , D2 , · · · , Dn is a set of datasets Di that are
load is mostly symmetric to the case of increased load. A        to be processed and O the operation to be executed.3
decreased load on CPU side causes shorter execution time            Goal: The goal is to choose the fastest algorithm Aj ∈
of CPU algorithms. The consequence is that real execution        APO for every dataset Di ∈ DS for the execution of opera-
time curves are shifted downwards, whereas the estimated         tion O.
execution time curves stay as they are.
   Limits of the Adaption: There is the possibility that                                    X
the described load adaption scheme can break, if the load                    min =                  Treal (Ak , Di ) with Ak ∈ APO   (1)
on CPU side increases or decreases too much. We consider                                   Di ∈DS

the case, where the load increases too much, the other case        Usage: The algorithm with the minimal estimated exe-
is symmetric. In Figure 3, we illustrate the former case.        cution time for the dataset D is chosen for execution. The
Hence, the real execution time curve of ACP U lies above the     function choose Algorithm (choose Alg) chooses an algorithm
real execution time curve of AGP U .                             according to the optimization criterion.
   If this state continues to reside a certain amount of time,
the estimated execution time curves will approach the real
execution time curves. If the load condition normalizes             choose Alg(D, O) = Aj with                                       (2)
again, then only algorithm AGP U is executed, regardless of                     Test (Aj , D) = min({Test (Ak , D)|∀Ak ∈ APO })
datasets that can be faster processed by algorithm ACP U .
The model is now stuck in an erroneous state since it can-         It is expected that the accuracy of the estimated execution
not make response time minimal decisions for operation O         times has a large impact on the decisions and the model
anymore.                                                         quality.
   However, this cannot happen due to the assumption that          Practical Use: The execution of the fastest algorithm
every algorithm is executed on a regular basis, even in over-    reduces the response time of the system and results in better
load conditions. Hence, a continuing supply of new mea-          performance of a DBS. These response time optimization is
surement pairs is ensured which allows the adaption of the       necessary to accelerate time critical tasks4 . Furthermore, it
                                                                 3
model to the current load condition.                               For simplicity, we only consider one operation per work-
                                                                 load. However, a set of operations is more realistic and can
                                                                 be added in future work.
                                                                 4
                                                                   A task is an operation in execution.
is possible to automatically fine-tune algorithm selection on      Let DMideal be the ideal decision model and let DMreal
a specific hardware configuration.                               be the real decision model. Then, the model quality MQ is
                                                                 defined as:
3.    MODEL QUALITY CRITERIA
  In this section, we present four model quality measures,                                         TDMideal (W )
                                                                              M Q(W, DMreal ) =                            (5)
namely average percentage estimation error, hit rate, model                                        TDMreal (W )
quality, and percentage speed increase.                            This measure describes to which degree the optimization
                                                                 goal described in formula 1 is achieved. Note, that the ideal
3.1   Average Percentage Estimation Error                        model is not introducing overhead (TOh (DMideal , W ) = 0).
   The idea of the average percentage estimation error is to
compute the estimation error for each executed algorithm         3.4    Percentage Speed Increase
and the corresponding estimation value. Then, the abso-
                                                                    The idea of the percentage speed increase (PSI) is to quan-
lute percentage estimation error is computed. Finally, the
                                                                 tify the performance gain, if a decision model DMi is re-
average of all computed percentage estimation errors is com-
                                                                 placed with a decision model DMj .
puted. This measure is also called relative error and is used,
e.g., in [1].
                                                                                                TDMi (W ) − TDMj (W )
                                                                      P SI(DMi → DMj , W ) =                               (6)
3.2   Hit Rate                                                                                        TDMi (W )
   The idea of this measure is to compute the ratio of the
                                                                 If P SI(DMi → DMj , W ) is greater than zero, DMj had a
number of correct decisions and the total number of deci-
                                                                 better performance than DMi and vice versa.
sions. A decision is correct, if and only if the model decided
for the fastest algorithm. In the ideal case all decisions are
correct, so the hit rate is 100%. In the worst case all de-      4.    RELATED WORK
cisions are wrong. Hence, the hit rate is 0%. The benefit           In this section, we present the related work. We con-
of the hit rate is that it can provide statistical information   sider analytical models for estimating execution times of
about how many decisions are wrong. If the hit rate is X         GPU algorithms. Furthermore, we discuss learning based
then every 1/(1-X) decision is incorrect. However, we can-       approaches to estimate execution times. Finally, we present
not quantify the impact of an incorrect decision. This is        existing decision models and provide a comparison to our
addressed by the model quality. Note, that algorithms with       approach.
very similar execution time curves can lead to a bad hit rate,      Analytical Models Hong et al. present an analytical
although the overall performance is acceptable.                  model which estimates execution times of massively parallel
                                                                 programs [8]. The basic idea is to estimate the number of
3.3   Model Quality                                              parallel memory requests. The approach uses information
   The idea of the model quality is to compute the ratio of      like number of running threads and the memory bandwidth.
the resulting execution times that a system using an ideal       Depending on their case study, relative estimation errors
model and a real model needs to process a workload W .           between 5,4% and 13,3% are observed.
In the ideal case, the real model would be as good as the           Zhang et al. develop a model that should help optimizing
ideal model. Hence, the model quality is 100%. The worst         GPU programs by arranging qualitative performance analy-
case model would always select the slowest algorithm and         ses [16]. They use a micro benchmark-based approach which
provides a lower bound for the model quality.                    needs hardware specific parameters, e.g., the number of pro-
   Let TDM (W ) be the time a system using a decision model      cessors or the clock rate. The model cannot adapt to load
(DM ) needs to process a workload W . It is computed out of      changes and therefore, it is not considered for use in our ap-
the sum of all algorithm execution times resulting from the      proach. The observed relative estimation error lies between
algorithm selection of the decision model for the workload       5% and 15%.
added with the overhead caused by DM . Let TOh (DM, W )             Learning based Execution Time Estimation Akdere
be the overhead the decision model DM causes. If a decision      et al. investigate modeling techniques for analytical work-
model introduces to much overhead, it can eliminate their        loads [1]. They estimate the execution behavior of queries
gain. Hence the overhead has to be considered, which is the      on the basis of different granularities. They presented a
sum of the total time needed for computing estimation val-       modeling approach for the estimation on query level and
ues and the total time needed to re-compute approximation        operation level. The basic idea of their approach is to per-
functions. The time needed for all estimation value compu-       form a feature extraction on queries and compute execution
tation (EVC) for a workload W is TEVC (W ). The time needed      time estimations based on them.
to re-compute the approximation functions of all algorithms         Matsunaga et al. present a short overview over machine
is TRC (W ). Both measures are highly dependent on DM .          learning algorithms and their fields of application [11]. The
Hence, we get the following formulas:                            goal is the estimation of the resource usage for an applica-
                                                                 tion. One considered resource is the execution time. The
                                                                 developed method PQR2 needs a few milliseconds for the
 TOh (DM, W ) = TEVC (W ) + TRC (W )                 (3)         computation of estimations. Since we need for n datasets
                 X                                               and m algorithms n·m computations the time for a single
     TDM (W ) =       Treal (choose AlgDM (D, O), D) (4)
                                                                 computation should be less than 50µs to keep the overhead
                   D∈DS
                                                                 at an absolute minimum. Hence, the approach of Matsunaga
                + TOh (DM, W )                                   et al. is to be investigated, whether it achieves good results
for a response time minimal operation distribution. This           model from operations to queries is necessary for better ap-
can be done in future work.                                        plicability of our model. This leads to the problem of hybrid
   Zhang et al. present a model for predicting the costs of        query plan optimization.
complex XML queries [15]. They use the statistical learn-
ing method called ”transform regression technique”. The            6.   REFERENCES
approach allows a self-tuning query optimizer that can dy-
                                                                    [1] M. Akdere and U. Çetintemel. Learning-based query
namically adapt to load condition changes. The approach
                                                                        performance modeling and prediction. IEEE, 2012.
of Zhang and our model are similar in basic functionalities.
The difference to our approach is that Zhang et al. estimate        [2] ALGLIB Project. ALGLIB. http://www.alglib.net/,
the execution time of XML queries whereas our approach                  2012. [Online; accessed 05-January-2012].
estimates execution times of single database operations to          [3] P. R. Anthony Ralston. A first course in numerical
dynamically distribute them on CPU and GPU. Addition-                   analysis. dover publications, second edition, 2001.
ally, we use a different learning method that is optimized for      [4] P. Bakkum and K. Skadron. Accelerating sql database
computational efficiency.                                               operations on a gpu with cuda. GPGPU ’10, pages
   Decision Models Kerr et al. present a model that pre-                94–103, New York, NY, USA, 2010. ACM.
dicts the performance of similar application classes for differ-    [5] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and
ent processors [10]. The approach allows to choose between              D. Manocha. Fast computation of database operations
CPU and GPU implementation. This choice is made stat-                   using graphics processors. SIGMOD ’04, pages
ically in contrast to our work, where an algorithm for an               215–226, New York, NY, USA, 2004. ACM.
operation execution is chosen dynamically at runtime. Kerr          [6] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju,
et al. are using only parameters that are statically known              Q. Luo, and P. V. Sander. Relational query
before program execution. Hence, it allows no adaption to               coprocessing on graphics processors. ACM Trans.
load changes in contrast to our model that allows load adap-            Database Syst., 34:21:1–21:39, December 2009.
tion.                                                               [7] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju,
   Iverson et al. develop an approach that estimates execu-             Q. Luo, and P. Sander. Relational joins on graphics
tion times of tasks in the context of distributed systems [9].          processors. SIGMOD ’08, pages 511–524, New York,
The approach, similar to our model, does not require hard-              NY, USA, 2008. ACM.
ware specific information. They use the learning method k-          [8] S. Hong and H. Kim. An analytical model for a gpu
nearest-neighbor, a non-parametric regression method. We                architecture with memory-level and thread-level
use least squares and cubic splines that are parametric re-             parallelism awareness. ACM SIGARCH Computer
gression methods which need less time for computing esti-               Architecture News, 37(3):152, 2009.
mations compared to non parametric regression methods.              [9] M. A. Iverson, F. Ozguner, and G. J. Follen. Run-time
The goal of Iverson et al. is an optimal selection of nodes             statistical estimation of task execution times for
in a distributed system, where a task is executed. Unlike               heterogeneous distributed computing. HPDC ’96,
this approach, our work has the goal for optimal algorithm              pages 263–270, Washington, DC, USA, 1996. IEEE
selection. It is possible to apply the approach of Iverson et           Computer Society.
al. on hybrid CPU/GPU platform. However, we consider,              [10] A. Kerr, G. Diamos, and S. Yalamanchili. Modeling
the GPU is a coprocessor of the CPU. Hence, a CPU/GPU                   gpu-cpu workloads and systems. GPGPU ’10, pages
platform is not a distributed system from our point of view.            31–42, New York, NY, USA, 2010. ACM.
In a way, our model is less general than the model of Iverson
                                                                   [11] A. Matsunaga and J. A. B. Fortes. On the use of
et al.
                                                                        machine learning to predict the time and resources
                                                                        consumed by applications. CCGRID, pages 495–504,
5.   CONCLUSION                                                         2010.
   In this paper, we addressed a current research problem,         [12] NVIDIA. NVIDIA CUDA C Programming Guide.
namely the optimal distribution of database operations on               http://developer.download.nvidia.com/compute/
hybrid CPU/GPU platforms. Furthermore, we develop a                     DevZone/docs/html/C/doc/CUDA_C_Programming_
self-tuning decision model that is able to distribute database          Guide.pdf, 2012. pages 1–5, Version 4.0, [Online;
operations on CPUs and GPUs in a response time mini-                    accessed 1-February-2012].
mal manner. We discuss the basic structure of the model            [13] H. Pirk, S. Manegold, and M. Kersten. Accelerating
and provide a qualitative argumentation of how the model                foreign-key joins using asymmetric memory channels.
works. Additionally, we present suitable model quality mea-             ADMS ’11, pages 585–597. VLDB Endowment, 2011.
sures, which are required to evaluate our model. Our exper-        [14] S. Walkowiak, K. Wawruch, M. Nowotka, L. Ligowski,
iments show that our model is almost as good as the ideal               and W. Rudnicki. Exploring utilisation of gpu for
model if the model parameters are set appropriately. We                 database applications. Procedia Computer Science,
omit the evaluation due to limited space. We conclude that              1(1):505–513, 2010.
distributing database operations on CPUs and GPUs has              [15] N. Zhang, P. J. Haas, V. Josifovski, G. M. Lohman,
a large optimization potential. We believe our model is a               and C. Zhang. Statistical learning techniques for
further step to address this issue.                                     costing xml queries. VLDB ’05, pages 289–300. VLDB
   In ongoing research, we use the defined quality measures             Endowment, 2005.
to evaluate our model on suitable use cases. A further pos-
                                                                   [16] Y. Zhang and J. D. Owens. A quantitative
sible extension is using the model in a more general context,
                                                                        performance analysis model for gpu architectures.
where response-time optimal decisions have to be made, e.g.,
                                                                        Computer Engineering, pages 382–393, 2011.
an optimal index usage. Furthermore, an extension of the