=Paper= {{Paper |id=None |storemode=property |title=Self-Tuning Distribution of DB-Operations on Hybrid CPU/GPU Platforms |pdfUrl=https://ceur-ws.org/Vol-850/paper_bress.pdf |volume=Vol-850 |dblpUrl=https://dblp.org/rec/conf/gvd/BressMS12 }} ==Self-Tuning Distribution of DB-Operations on Hybrid CPU/GPU Platforms== https://ceur-ws.org/Vol-850/paper_bress.pdf

Self-Tuning Distribution of DB-Operations on Hybrid
CPU/GPU Platforms

Sebastian Breß Siba Mohammad Eike Schallehn
Otto-von-Guericke University Otto-von-Guericke University Otto-von-Guericke University
Magdeburg Magdeburg Magdeburg
bress@iti.cs.uni-magdeburg.de siba.mohammad@st.ovgu.de eike@iti.cs.uni-magdeburg.de

ABSTRACT 1.1 New Research Trend
A current research trend focuses on accelerating database A new research trend focuses on speeding up database
operations with the help of GPUs (Graphics Processing Units). operations by performing them on GPUs [4, 6, 13, 14].
Since GPU algorithms are not necessarily faster than their He et al. present the concept and implementation of rela-
CPU counterparts, it is important to use them only if they tional joins on GPUs [6, 7]. Pirk et al. develop an approach
outperform their CPU counterparts. In this paper, we ad- to accelerate indexed foreign key joins with GPUs [13]. The
dress this problem by constructing a decision model for a foreign keys are streamed over the PCIe bus while random
framework that is able to distribute database operations re- lookups are performed on the GPU. Walkowiak et al. dis-
sponse time minimal on CPUs and GPUs. Furthermore, cuss the usability of GPUs for databases [14]. They show the
we discuss necessary quality measures for evaluating our applicability on the basis of an n-gram based text search en-
model. gine. Bakkum et al. develop a concept and implementation
of the SQLite command processor on the GPU [4]. The main
target of their work is the acceleration of a subset of pos-
1. INTRODUCTION sible SQL queries. Govindaraju et al. present an approach
In the context of database tuning, there are many differ- to accelerate selections and aggregations with the help of
ent approaches for performance optimization. A new oppor- GPUs [5]. From the examples, we can conclude that a GPU
tunity for optimization was introduced with General Pur- can be an effective coprocessor for the execution of database
pose Computation on Graphics Processing Units (GPGPU) operations.
[12]. This approach allows to speed up applications that are
suited for parallel processing with the help of GPUs. Parallel 1.2 Problem Statement
computing architectures like compute unified device archi- We assume that an operation is executed through an algo-
tecture (CUDA) [12] make it possible to program a GPU rithm which uses either CPU or GPU. For which operations
almost as simple as a CPU. This technology opens a new and data is it efficient to execute database operations on a
branch in research that focuses on accelerating applications GPU? A GPU algorithm can be faster or slower as its CPU
using GPUs. counterpart. Therefore, the GPU should be used in a mean-
CPUs are optimized for a low response time, meaning ingful way to achieve maximal performance. That means,
that they execute one task as fast as possible. The GPU only if it is to be expected that the GPU algorithm is faster
is optimized for high throughput, meaning they execute as it should be executed.
many tasks as possible in a fixed time. This is accom- We do not know a priori which processing device (CPU
plished by massively parallel execution of programs by using or GPU) is faster for which datasets in a given hardware
multi threading. Furthermore, GPUs specialize on compute- configuration. We have to take different requirements into
intensive tasks, which is typical for graphics applications. account to determine the fastest processing device:
Additionally, tasks with much control flow decrease perfor-
• the operation to be executed,
mance on GPUs, but can be handled well by CPUs [12].
Consequently, database operations benefit differently by us- • the size of the dataset to process,
ing the GPU. Aggregations are most suited for GPU usage,
whereas selections should be outsourced with care. He et al. • the processing power of CPU and GPU (number of
observed that selections are 2-3 times slower on the GPU processor cores, clock rate), and
compared with the CPU [6]. • the current load condition on CPU and GPU.
The following contributions are discussed in the following:
1. A response time minimal distribution of database op-
erations on CPUs and GPUs can achieve shorter exe-
cution times for operations and speedup query execu-
tions. Hence, a basic idea how such a model can be
constructed is depicted in this paper.
24th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 29.05.2012 - 01.06.2012, Lübbenau, Germany. 2. For the evaluation of our model, we define appropriate
Copyright is held by the author/owner(s). quality measures.
To the best of our knowledge, there is no self-tuning deci- mation function FA (D).
sion model that can distribute database operations response Model Components: The model is composed of three
time minimal on CPUs and GPUs. components. The first is the algorithm pool APO which
The remainder of the paper is structured as follows. In contains all available algorithm of an operation O. The sec-
Section 2, we present our decision model. Then, we discuss ond is the estimation component which computes execution
model quality metrics needed for evaluation in Section 3. time estimations for every algorithm of an operation. The
Section 4 provides background about related work. Finally, third part is the decision component which chooses the al-
we present our conclusion and future work. gorithm that fits best for the specified optimization criteria.
Depending on whether the chosen algorithm uses the CPU
2. DECISION MODEL or the GPU, the corresponding operation is executed on the
CPU or the GPU. In this paper, we only consider response
In this section, we present our decision model. At first,
time optimization. However, other optimization criteria like
we discuss the basic model. Then, we explain details of the
throughput can be added in future work. Figure 1 summa-
estimation and the decision components.
rizes the model structure.
2.1 Basic Model 2.2 Estimation Component
We construct a model that is able to choose the response
This section describes the functionality of the estimation
time minimal algorithm from a set of algorithms for process-
component. First, we discuss when the approximation func-
ing a dataset D. We store observations of past algorithm
tions of the algorithms should be updated. Second, we ex-
executions and use statistical methods to interpolate future
amine how the model can adapt to load changes.
execution times. We choose the algorithm with the smallest
estimated execution time for processing a dataset. 2.2.1 Updating the Approximation Functions
Definitions: Let O be a database operation and let APO =
For an operation O that will be applied on a dataset D
{A1 , .., Am } be an algorithm pool for operation O, i.e., a
the estimation component computes for each available algo-
set of algorithms that is available for the execution of O.
rithm of O an estimated execution time. For this, we need
We assume that every algorithm has different performance
an approximation function that can be used to compute esti-
characteristics. Hence, the execution times of different al-
mations. Since we learn such functions with statistical meth-
gorithms needed to process a dataset D are likely to vary.
ods, we need a number of observations for each algorithm
A data set D provides characteristic features of the input
to compute the corresponding approximation functions. Af-
data. In this paper, we consider only the data size. Other
ter we computed an initial function for each algorithm, our
characteristics, e.g., data distribution or data types will be
model is used to make decisions. Hence, the model opera-
considered in future work.
tion can be divided into two phases. The first is the initial
Let TA (D) be the execution time of an algorithm to pro-
training phase. The second is the operational phase.
cess the dataset D. We do not consider hardware specific
Since the load condition can change over time, execution
parameters like clock rate and number of processor cores to
times of algorithms are likely to change as well. Hence, the
estimate execution times. Instead, we learn the execution
model should provide a mechanism for an adoption to load
behavior of every algorithm. For this purpose, we assign to
changes. Therefore, the model continuously collects mea-
every algorithm A a learning method LA and a correspond-
surement pairs of all executed algorithms of an operation.
ing approximation function FA (D). Let Test (A, D)=FA (D)
There are two problems to solve:
be an estimated execution time of algorithm A for a dataset
D. A measured execution time is referred to as Treal (A, D). 1. Re-computation Problem: find a strategy when to
Let a measurement pair (MP) be a tuple (D,Treal (A, D)). re-compute the approximation function for each algo-
Let M P LA be a measurement pair list, containing all cur- rithm, so that a good trade-off between accuracy and
rent measurement pairs of algorithm A. overhead is achieved.
Statistical Methods: We consider the following statis-
tical methods for the approximation of the execution be- 2. Cleanup Problem: delete outdated measurements
havior of all algorithms. The first one is the least squares pairs from a measurement pair list because the consi-
method and the second is spline interpolation with cubic deration of measurement pairs from past load condi-
splines [3]. We use these approaches because we observe in tions is likely to be less beneficial for estimation ac-
our experiments minimal overhead and a good accuracy (rel- curacy. Furthermore, every measurement pair needs
ative error < 10%) of the estimated execution times. While storage space and results in higher processing time for
other methods can learn the execution behavior depending the statistical methods.
on more than one feature, they often need a large amount of
time for updating approximation functions and computing There are different heuristics that deal with the re-com-
estimations, see Section 4. In the case of the least square putation of approximation functions. Zhang et al. present
method FA (D) is a polynomial of degree n. In the case of possible approaches in [15]. These are:
cubic splines, FA (D) is a spline.
Abstraction: The execution time of algorithms is highly 1. Re-compute the approximation function always after a
dependent on specific parameters of the given processing new measurement pair was added to the measurement
hardware. In practice, it is problematic to manage all pa- pair list.
rameters, so maintenance would become even more costly. 2. Re-compute the approximation function periodically.
Hence, we do not consider hardware specific parameters and
let the model learn the execution behavior of algorithms us- 3. Re-compute the approximation function, if the estima-
ing statistical method LA and the corresponding approxi- tion error exceeds a certain threshold.
operation O dataset D optimization criterion
A1, Test(A1,D),
A2,..., Test(A2,D),...,
algorithm pool An estimation component Test(An,D)
CPU GPU MPLA1 ... MPLAi ... MPLAn
decision component

MP=(D,Treal(Ai))
Ai

Figure 1: Information flow in the model

As Zhang et al. point out, the most aggressive method is Self Tuning Cycle: The model performs the following
unlikely to achieve good results because the expected over- self tuning cycle during the operational phase:
head for re-computing the approximation function counter-
acts possible performance gains. Hence, we have to choose 1. Use approximation functions to compute execution time
between approach 2 and 3. There is no guarantee that one estimations for all algorithms in the algorithm pool of
approach causes less overhead than the other. operation O for the dataset D.
On one hand, periodic re-computation causes a predictable
overhead, but it may re-compute approximation functions 2. Select the algorithm with the minimal estimated re-
when it is not necessary, e.g., the relative estimation error sponse time.
is still beneath an error threshold. On the other hand, an
3. Execute the selected algorithm and measure its execu-
event based approach re-computes the approximation func-
tion time. Add the new measurement pair to the mea-
tions only if it is necessary, but has the disadvantage, that
surement pair list M P LA of the executed algorithm
the quality of estimations has to decrease until the approx-
A.
imation functions are re-computed. We choose the periodic
approach, because of it’s predictable overhead. We will con- 4. If the new measurement pair is the RCR new pair in
sider the event based approach in future work. the list, then the approximation function of the cor-
Parameters: Now we discuss necessary parameters of responding algorithm will be re-computed using the
the model. Let RCR be the re-computation rate of an algo- assigned statistical method.
rithm. That means that after RCR measurement pairs were
added to the measurement pair list of an algorithm A, the 2.2.2 Adaption to Load Changes
approximation function FA (D) of A is re-computed.1 Hence, This section focuses on the adaption to load changes of the
the used time measure is the number of added measurement model. We start with necessary assumptions, then proceed
pairs. Let ITL be the initial training length, which is the with the discussion how the model can adapt to new load
number of measurement pairs that have to be collected for conditions.
each algorithm, before the model switches into its opera- Let ACP U be a CPU algorithm and AGP U a GPU algo-
tional phase. rithm for the same operation O.
Now we address the Cleanup Problem. We have to limit The basic assumptions are:
the number of measurement pairs in the measurement pair
list to keep space and computation requirements low. Fur- 1. Every algorithm is executed on a regular basis, even in
thermore, we want to delete old measurement pairs from the overload conditions. That ensures a continuing supply
list that do not contribute to our current estimation problem of new measurement pairs. If this assumption holds
sufficiently. These requirements are fulfilled by a ring buffer then a gradual adaption of the approximation func-
data structure. Let RBS be the ring buffer size in number of tions to a changed load condition is possible.
measurement pairs. If a new measurement pair is added to
a full ring buffer, it overrides the oldest measurement pair. 2. A change in the load condition has to last a certain
Hence, the usage of a ring buffer solves the Cleanup Problem. amount of time, otherwise the model cannot adapt
RCR and RBS are related because if RCR is greater than and delivers more vague execution time estimations.
RBS there would be measurement pairs that are never con- If this assumption does not hold, it is not possible to
sidered for computing the approximation functions. Hence, continuously deliver good estimations.
RCR should be smaller than RBS .
Statistical Methods: The used statistical methods have 3. The load condition of CPU and GPU are not related.
to be computationally efficient when computing approxima-
tion functions and estimation values. Additionally, they As long as the assumptions are fulfilled, the estimated exe-
should provide good estimations (relative estimation error cution time curves2 approach the real execution time curves
<10%). Since the times needed to compute estimations if the load changes. Increase and decrease of the load con-
sums up over time, we consider a method computationally dition on the side of the CPU is symmetric compared to an
efficient if it can compute one estimation value in less than equivalent change on the side of the GPU. For simplicity, but
50µs. Our experiments with the ALGLIB [2] show that this without the loss of generality, we discuss the load adaption
is the case for the least squares and cubic splines. mechanism for the CPU side.
2
An execution time curve is the graphical representation of
1
For each operation one measurement pair is added to the all execution times an algorithm needs to process all datasets
list of the selected algorithm. in a workload.
Treal(AGPU,D) Treal(AGPU,D)
Treal(ACPU,D) Treal(ACPU,D)
execution time Test (ACPU,D) Test (ACPU,D)

execution time
data size data size

Figure 2: Example for increase load condition on Figure 3: Example for strong increase of load on
CPU side CPU side

Increasing Load Condition: If the load condition in- 2.3 Decision Component
creases on the side of the CPU then the execution times of In this section, we describe the decision component of our
CPU algorithms increase and the real execution time curves model. Due to limited space, we introduce only one opti-
are shifted upwards. In contrast, the estimated execution mization criterion, namely response time, but other criteria
time curves stay as they are. We illustrate this situation in like throughput are possible.
Figure 2. The model is adapted to the new load condition
by collecting new measurement pairs and recomputing the 2.3.1 Response Time Optimization
approximation function. This will shift the estimated execu- If we optimize the operation execution for response time,
tion time curve in the direction of the measured execution we want to select the algorithm with the minimal execu-
time curve. Hence, the estimations become more precise. tion time for the dataset D. In choosing the algorithm with
After at maximum RCR newly added measurement pairs the minimal execution time, we distribute the operation re-
for one algorithm the estimated execution time curve ap- sponse time minimal to CPU and GPU. There is always one
proaches the real execution time curve. This implies the as- decision per operation that considers the size of the dataset
sumption that a re-computation of approximation functions that has to be processed.
never has a negative impact on the estimation accuracy. Definition: Let a workload W be a tuple W = (DS, O),
Decreasing Load Condition: The case of a decreasing where DS = D1 , D2 , · · · , Dn is a set of datasets Di that are
load is mostly symmetric to the case of increased load. A to be processed and O the operation to be executed.3
decreased load on CPU side causes shorter execution time Goal: The goal is to choose the fastest algorithm Aj ∈
of CPU algorithms. The consequence is that real execution APO for every dataset Di ∈ DS for the execution of opera-
time curves are shifted downwards, whereas the estimated tion O.
execution time curves stay as they are.
Limits of the Adaption: There is the possibility that X
the described load adaption scheme can break, if the load min = Treal (Ak , Di ) with Ak ∈ APO (1)
on CPU side increases or decreases too much. We consider Di ∈DS

the case, where the load increases too much, the other case Usage: The algorithm with the minimal estimated exe-
is symmetric. In Figure 3, we illustrate the former case. cution time for the dataset D is chosen for execution. The
Hence, the real execution time curve of ACP U lies above the function choose Algorithm (choose Alg) chooses an algorithm
real execution time curve of AGP U . according to the optimization criterion.
If this state continues to reside a certain amount of time,
the estimated execution time curves will approach the real
execution time curves. If the load condition normalizes choose Alg(D, O) = Aj with (2)
again, then only algorithm AGP U is executed, regardless of Test (Aj , D) = min({Test (Ak , D)|∀Ak ∈ APO })
datasets that can be faster processed by algorithm ACP U .
The model is now stuck in an erroneous state since it can- It is expected that the accuracy of the estimated execution
not make response time minimal decisions for operation O times has a large impact on the decisions and the model
anymore. quality.
However, this cannot happen due to the assumption that Practical Use: The execution of the fastest algorithm
every algorithm is executed on a regular basis, even in over- reduces the response time of the system and results in better
load conditions. Hence, a continuing supply of new mea- performance of a DBS. These response time optimization is
surement pairs is ensured which allows the adaption of the necessary to accelerate time critical tasks4 . Furthermore, it
3
model to the current load condition. For simplicity, we only consider one operation per work-
load. However, a set of operations is more realistic and can
be added in future work.
4
A task is an operation in execution.
is possible to automatically fine-tune algorithm selection on Let DMideal be the ideal decision model and let DMreal
a specific hardware configuration. be the real decision model. Then, the model quality MQ is
defined as:
3. MODEL QUALITY CRITERIA
In this section, we present four model quality measures, TDMideal (W )
M Q(W, DMreal ) = (5)
namely average percentage estimation error, hit rate, model TDMreal (W )
quality, and percentage speed increase. This measure describes to which degree the optimization
goal described in formula 1 is achieved. Note, that the ideal
3.1 Average Percentage Estimation Error model is not introducing overhead (TOh (DMideal , W ) = 0).
The idea of the average percentage estimation error is to
compute the estimation error for each executed algorithm 3.4 Percentage Speed Increase
and the corresponding estimation value. Then, the abso-
The idea of the percentage speed increase (PSI) is to quan-
lute percentage estimation error is computed. Finally, the
tify the performance gain, if a decision model DMi is re-
average of all computed percentage estimation errors is com-
placed with a decision model DMj .
puted. This measure is also called relative error and is used,
e.g., in [1].
TDMi (W ) − TDMj (W )
P SI(DMi → DMj , W ) = (6)
3.2 Hit Rate TDMi (W )
The idea of this measure is to compute the ratio of the
If P SI(DMi → DMj , W ) is greater than zero, DMj had a
number of correct decisions and the total number of deci-
better performance than DMi and vice versa.
sions. A decision is correct, if and only if the model decided
for the fastest algorithm. In the ideal case all decisions are
correct, so the hit rate is 100%. In the worst case all de- 4. RELATED WORK
cisions are wrong. Hence, the hit rate is 0%. The benefit In this section, we present the related work. We con-
of the hit rate is that it can provide statistical information sider analytical models for estimating execution times of
about how many decisions are wrong. If the hit rate is X GPU algorithms. Furthermore, we discuss learning based
then every 1/(1-X) decision is incorrect. However, we can- approaches to estimate execution times. Finally, we present
not quantify the impact of an incorrect decision. This is existing decision models and provide a comparison to our
addressed by the model quality. Note, that algorithms with approach.
very similar execution time curves can lead to a bad hit rate, Analytical Models Hong et al. present an analytical
although the overall performance is acceptable. model which estimates execution times of massively parallel
programs [8]. The basic idea is to estimate the number of
3.3 Model Quality parallel memory requests. The approach uses information
The idea of the model quality is to compute the ratio of like number of running threads and the memory bandwidth.
the resulting execution times that a system using an ideal Depending on their case study, relative estimation errors
model and a real model needs to process a workload W . between 5,4% and 13,3% are observed.
In the ideal case, the real model would be as good as the Zhang et al. develop a model that should help optimizing
ideal model. Hence, the model quality is 100%. The worst GPU programs by arranging qualitative performance analy-
case model would always select the slowest algorithm and ses [16]. They use a micro benchmark-based approach which
provides a lower bound for the model quality. needs hardware specific parameters, e.g., the number of pro-
Let TDM (W ) be the time a system using a decision model cessors or the clock rate. The model cannot adapt to load
(DM ) needs to process a workload W . It is computed out of changes and therefore, it is not considered for use in our ap-
the sum of all algorithm execution times resulting from the proach. The observed relative estimation error lies between
algorithm selection of the decision model for the workload 5% and 15%.
added with the overhead caused by DM . Let TOh (DM, W ) Learning based Execution Time Estimation Akdere
be the overhead the decision model DM causes. If a decision et al. investigate modeling techniques for analytical work-
model introduces to much overhead, it can eliminate their loads [1]. They estimate the execution behavior of queries
gain. Hence the overhead has to be considered, which is the on the basis of different granularities. They presented a
sum of the total time needed for computing estimation val- modeling approach for the estimation on query level and
ues and the total time needed to re-compute approximation operation level. The basic idea of their approach is to per-
functions. The time needed for all estimation value compu- form a feature extraction on queries and compute execution
tation (EVC) for a workload W is TEVC (W ). The time needed time estimations based on them.
to re-compute the approximation functions of all algorithms Matsunaga et al. present a short overview over machine
is TRC (W ). Both measures are highly dependent on DM . learning algorithms and their fields of application [11]. The
Hence, we get the following formulas: goal is the estimation of the resource usage for an applica-
tion. One considered resource is the execution time. The
developed method PQR2 needs a few milliseconds for the
TOh (DM, W ) = TEVC (W ) + TRC (W ) (3) computation of estimations. Since we need for n datasets
X and m algorithms n·m computations the time for a single
TDM (W ) = Treal (choose AlgDM (D, O), D) (4)
computation should be less than 50µs to keep the overhead
D∈DS
at an absolute minimum. Hence, the approach of Matsunaga
+ TOh (DM, W ) et al. is to be investigated, whether it achieves good results
for a response time minimal operation distribution. This model from operations to queries is necessary for better ap-
can be done in future work. plicability of our model. This leads to the problem of hybrid
Zhang et al. present a model for predicting the costs of query plan optimization.
complex XML queries [15]. They use the statistical learn-
ing method called ”transform regression technique”. The 6. REFERENCES
approach allows a self-tuning query optimizer that can dy-
[1] M. Akdere and U. Çetintemel. Learning-based query
namically adapt to load condition changes. The approach
performance modeling and prediction. IEEE, 2012.
of Zhang and our model are similar in basic functionalities.
The difference to our approach is that Zhang et al. estimate [2] ALGLIB Project. ALGLIB. http://www.alglib.net/,
the execution time of XML queries whereas our approach 2012. [Online; accessed 05-January-2012].
estimates execution times of single database operations to [3] P. R. Anthony Ralston. A first course in numerical
dynamically distribute them on CPU and GPU. Addition- analysis. dover publications, second edition, 2001.
ally, we use a different learning method that is optimized for [4] P. Bakkum and K. Skadron. Accelerating sql database
computational efficiency. operations on a gpu with cuda. GPGPU ’10, pages
Decision Models Kerr et al. present a model that pre- 94–103, New York, NY, USA, 2010. ACM.
dicts the performance of similar application classes for differ- [5] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and
ent processors [10]. The approach allows to choose between D. Manocha. Fast computation of database operations
CPU and GPU implementation. This choice is made stat- using graphics processors. SIGMOD ’04, pages
ically in contrast to our work, where an algorithm for an 215–226, New York, NY, USA, 2004. ACM.
operation execution is chosen dynamically at runtime. Kerr [6] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju,
et al. are using only parameters that are statically known Q. Luo, and P. V. Sander. Relational query
before program execution. Hence, it allows no adaption to coprocessing on graphics processors. ACM Trans.
load changes in contrast to our model that allows load adap- Database Syst., 34:21:1–21:39, December 2009.
tion. [7] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju,
Iverson et al. develop an approach that estimates execu- Q. Luo, and P. Sander. Relational joins on graphics
tion times of tasks in the context of distributed systems [9]. processors. SIGMOD ’08, pages 511–524, New York,
The approach, similar to our model, does not require hard- NY, USA, 2008. ACM.
ware specific information. They use the learning method k- [8] S. Hong and H. Kim. An analytical model for a gpu
nearest-neighbor, a non-parametric regression method. We architecture with memory-level and thread-level
use least squares and cubic splines that are parametric re- parallelism awareness. ACM SIGARCH Computer
gression methods which need less time for computing esti- Architecture News, 37(3):152, 2009.
mations compared to non parametric regression methods. [9] M. A. Iverson, F. Ozguner, and G. J. Follen. Run-time
The goal of Iverson et al. is an optimal selection of nodes statistical estimation of task execution times for
in a distributed system, where a task is executed. Unlike heterogeneous distributed computing. HPDC ’96,
this approach, our work has the goal for optimal algorithm pages 263–270, Washington, DC, USA, 1996. IEEE
selection. It is possible to apply the approach of Iverson et Computer Society.
al. on hybrid CPU/GPU platform. However, we consider, [10] A. Kerr, G. Diamos, and S. Yalamanchili. Modeling
the GPU is a coprocessor of the CPU. Hence, a CPU/GPU gpu-cpu workloads and systems. GPGPU ’10, pages
platform is not a distributed system from our point of view. 31–42, New York, NY, USA, 2010. ACM.
In a way, our model is less general than the model of Iverson
[11] A. Matsunaga and J. A. B. Fortes. On the use of
et al.
machine learning to predict the time and resources
consumed by applications. CCGRID, pages 495–504,
5. CONCLUSION 2010.
In this paper, we addressed a current research problem, [12] NVIDIA. NVIDIA CUDA C Programming Guide.
namely the optimal distribution of database operations on http://developer.download.nvidia.com/compute/
hybrid CPU/GPU platforms. Furthermore, we develop a DevZone/docs/html/C/doc/CUDA_C_Programming_
self-tuning decision model that is able to distribute database Guide.pdf, 2012. pages 1–5, Version 4.0, [Online;
operations on CPUs and GPUs in a response time mini- accessed 1-February-2012].
mal manner. We discuss the basic structure of the model [13] H. Pirk, S. Manegold, and M. Kersten. Accelerating
and provide a qualitative argumentation of how the model foreign-key joins using asymmetric memory channels.
works. Additionally, we present suitable model quality mea- ADMS ’11, pages 585–597. VLDB Endowment, 2011.
sures, which are required to evaluate our model. Our exper- [14] S. Walkowiak, K. Wawruch, M. Nowotka, L. Ligowski,
iments show that our model is almost as good as the ideal and W. Rudnicki. Exploring utilisation of gpu for
model if the model parameters are set appropriately. We database applications. Procedia Computer Science,
omit the evaluation due to limited space. We conclude that 1(1):505–513, 2010.
distributing database operations on CPUs and GPUs has [15] N. Zhang, P. J. Haas, V. Josifovski, G. M. Lohman,
a large optimization potential. We believe our model is a and C. Zhang. Statistical learning techniques for
further step to address this issue. costing xml queries. VLDB ’05, pages 289–300. VLDB
In ongoing research, we use the defined quality measures Endowment, 2005.
to evaluate our model on suitable use cases. A further pos-
[16] Y. Zhang and J. D. Owens. A quantitative
sible extension is using the model in a more general context,
performance analysis model for gpu architectures.
where response-time optimal decisions have to be made, e.g.,
Computer Engineering, pages 382–393, 2011.
an optimal index usage. Furthermore, an extension of the