=Paper=
{{Paper
|id=Vol-2126/paper1
|storemode=property
|title=Adaptive Data Processing in Heterogeneous Hardware Systems
|pdfUrl=https://ceur-ws.org/Vol-2126/paper1.pdf
|volume=Vol-2126
|authors=Bala Gurumurthy,Tobias Drews,David Broneske,Gunter Saake,Thilo Pionteck
|dblpUrl=https://dblp.org/rec/conf/gvd/GurumurthyDBSP18
}}
==Adaptive Data Processing in Heterogeneous Hardware Systems==
Adaptive Data Processing in Heterogeneous Hardware
Systems
Bala Gurumurthy Tobias Drewes David Broneske
Otto-von-Guericke-Universität Otto-von-Guericke-Universität Otto-von-Guericke-Universität
bala.gurumurthy@ovgu.de tobias.drewes@ovgu.de david.broneske@ovgu.de
Gunter Saake Thilo Pionteck
Otto-von-Guericke-Universität Otto-von-Guericke-Universität
gunter.saake@ovgu.de thilo.pionteck@ovgu.de
ABSTRACT efficiency [12].
In recent years, Database Management System have seen Further, the throughput of any software system depends
advancements in two major sectors, namely functional and on its underlying hardware. Many specialized compute de-
hardware support. Before, a traditional DBMS was suffi- vices are fabricated to perform certain specific tasks effi-
cient for performing a given operation, whereas a current ciently. These devices trade-off higher efficiency in cer-
DBMS is required to perform complex analytical tasks like tain tasks for generality. They are used as additional co-
graph analysis or OLAP. These operations require addi- processors along with the CPU for better throughput. One
tional functions to be added to the system for processing. of the commonly used co-processor is a GPU, which is
Also, a similar evolution is seen in the underlying hard- mainly used for enhancing graphical processing in a sys-
ware. This advancement in both functional and hardware tem. The parallelism in GPU has been already exploited
domain of DBMS requires modification of its overall archi- extensively for several DBMS operations [3, 1]. Similarly,
tecture. Hence, it is evident that an adaptable DBMS is other devices are available such as MIC (Many Integrated
necessary for supporting this highly volatile environment. Cores), APU (Accelerated Processing Unit), FPGA (Field
In this work, we list the challenges present for an adaptable Programmable Gate Array), etc., that could be exploited for
DBMS and propose a conceptual model for such a system efficiently executing DBMS operations. The major challenge
that provides interfaces to easily adapt to the software and in integrating this hardware, is the execution of the device
hardware changes. specific variant of the same DBMS operation optimized for
the given hardware.
Thus, the availability of different hardware enables a new
1. INTRODUCTION level of parallelism that we call cross-device parallelism. In
In recent years, a traditional Database Management Sys- this level, a user given query is executed in parallel among
tem (DBMS) is required to also perform various complex different devices for concurrent execution. Along with cross-
operations that are not directly supported by it. To per- device parallelism, we also have the traditional pipeline and
form these special analytical operations, several tailor-made data parallel execution of functions to increase the efficiency.
functions are integrated into the DBMS [12]. Further, the These dimensions of parallelism incurs additional complex-
efficiency of a DBMS depends on the throughput of the un- ity of efficiently traversing them to determine the optimal
derlying hardware. DBMS operations are ported to differ- execution path for executing a given query.
ent specialized hardware to achieve better throughput. This Along with optimization, the heterogeneity of hardware
heterogeneity of functionalities and hardware systems re- requires concepts for reducing the data transfer cost among
quire modifications of the existing DBMS structure incur- different devices. In a main memory DBMS, the device
ring additional complexities and challenges. transfer bottleneck exists between main-memory and co-
In the current context, various analytical tasks such as processing devices. Hence, it is also crucial to minimize
graph processing, or data mining are executed in DBMSs [8, the data transfer time for improving the efficiency of DBMS
9]. These functionalities are ported to a DBMS with an processing.
overhead of altering the overall architecture of the system. Hence, heterogeneity of functions and hardware has mul-
However, modification of the complete system structure is tiple challenges to be addressed and requires a system that
time consuming and also not all components are tuned for is adaptable for these changes. In this work, we provide our
insights on the challenges present in developing an adaptive
database system and the techniques on overcoming these
challenges. As there could be more functionalities and hard-
ware available in future to be integrated into DBMS, we
focus on a plug’n’play architecture that enables addition of
these newer functions and hardware with considerably lesser
overhead than upgrading the complete architecture. This ar-
chitecture provides interfaces for integrating different func-
30th GI-Workshop on Foundations of Databases (Grundlagen von Daten- tionalities and hardware into DBMS with less effort.
banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany.
Copyright is held by the author/owner(s).
The main contributions from our work are, Other Devices
There are other hardware used for DBMS are MICs (Many
• The existing challenges for an adaptive DBMS in the Integrated Core) and APUs (Accelerated Processing Units).
context of hardware and software heterogeneity. In case of MIC, there are multiple CPU cores available for
processing connected with each other using an on-chip bus
• The concepts for developing an adaptable DBMS with
system. These processors are capable of performing complex
plug’n’play capabilities.
computations. Whereas, APUs have both CPU and GPU in
a single die. Here, both both CPU and GPU have access to
The subsequent paper is structured as follows. In Sec-
the same memory space (i.e. main memory).
tion 2, we provide an overview of the different devices used
for DBMS and list out the challenges in using them. Then
in Section 3, we discuss about the various challenges present 3. CHALLENGES IN HETEROGENEOUS
due to functional and hardware heterogeneity in DBMS and ENVIRONMENTS
in Section 4, we provide our concepts on developing an
adaptable DBMS that addresses these challenges. The con- To have a DBMS adaptable to both changing hardware
ceptual discussion in this paper are already explored in dif- and software the following challenges has to be addressed.
ferent works and we detail about them in Section 7. Finally,
we provide the summary of the paper in Section 8. 3.1 Device Features
Adding DBMS operation to a new processing device re-
quires novel ways to exploit the device without compromis-
2. DBMS IN HETEROGENEOUS HARD- ing the overall system design. Hence, one of the major chal-
WARE ENVIRONMENT lenge is to reorganize the processing functions based on the
Relying on CPUs as the working horse is approaching the hardware features available and must also adapt the under-
limits of their efficiency [2]. There are multiple works con- lying functions for efficient execution in the device.
ducted to port the existing database operations to different
hardware. We discuss these devices and their DBMS sup- 3.2 Abstraction Hierarchy
port below. It is shown that speedup gains for any particular database
operation can be achieved by performing device-specific pa-
GPU rameter tuning of the given operation [5]. Removing these
Single core of the GPU has a lower clock frequency com- device-specific parameters aids adaptability but makes it
pared to a CPU core, a GPU features several hundreds of hard to tune for optimal execution as each device has its
them compared to several tens of cores that current CPUs own advantage. Due to this polarity in abstraction versus
offer. However, especially memory accesses have high la- specialization between functions and devices, it is required
tency that needs to be hidden by processing. To this end, that we find a good abstraction level for the operations that
they spawn multiple threads for a given function and do provides both an interface to write new functions and also
context switching to hide the latency. This massive paral- exploits the hardware for optimal efficiency.
lelism in GPU are useful in performing data intensive DBMS
operations. Some of the DBMS using GPU are, CoGaDB, 3.3 Parallelism Complexity
GPUDB, etc,. [3, 7]. The growth of DBMS in both functional and hardware
The major open challenges in using GPU for DBMS are, level provides various parallelization opportunities. Pres-
ence of multiple devices creates an additional paradigm :
1. Cost model for determining the executable operator in cross-device parallelization. Using this type of parallelism,
GPU during runtime the given query is divided into granular parts based on the
level of abstraction selected and these functional primitives
2. Combined query compilation and execution strategies are distributed among the different processing devices for
for CPU and GPU. parallel processing. We detail the different types of paral-
lelization below,
FPGA
Another hardware that has gained much attention in recent
Functional Parallelism
years is a FPGA (Field Programmable Gate Array). They In multiple instances, the incoming queries have various sub-
are programmed either using RTL (Register Transfer Level) operations that run independent to each other. One com-
languages (VHDL, Verilog) or via HLS (High-Level Synthe- mon example is the availability of multiple selection pred-
sis), where the circuits are extracted for example from C or icates combined using logical operations. These predicates
OpenCL code. This provides a platform that can be tuned can be executed in parallel among the different devices and
to perfection for any given domain specific operation pro- the results are combined in next steps. Thus, the other
viding higher throughput. way round: identifying and dissecting and identifying these
The open challenges in using FPGA are, parallel operations provide additional capabilities for simul-
taneous execution in the form of functional parallelism. The
1. Selection and placement of operators for partial recon- major challenge in this parallelism is the intermediate step
figurable implementations of materialization of the results to be processed in the next
operator in the pipeline. There is also a synchronization
2. Efficient pipelining between different operations at overhead present in this parallelism due to the differences in
runtime the execution time for different processing devices.
Data Parallelism
In contrast to functional parallelism, data parallelism does
not split an operation into to different functions but exe-
cutes same operation on different partitions of the data con-
currently. This method also has a similar synchronization
overhead of waiting for all the devices to finish processing.
The major disadvantage of this parallelism is the additional
step to merge results from different devices.
Cross-Device Parallelism
The above mentioned functional and data level parallelism
are decided after the selection of processing devices. As we
mentioned earlier, each devices have their own perks and
must be utilized to the maximum extent. Hence, it is neces-
sary to decide on the implementation details for the given de-
vice that exploits the hardware for efficient execution. More-
over, the above mentioned parallelization strategies can also
be realized in the device level. In terms of device-level func-
tional parallelism, it could be a multiple operator running in
parallel in different devices or in a pipeline with communica-
tion within the devices. Similarly, the data parallelism could
also be realized via suitable cost functions for operations on
devices.
3.4 Optimization Strategies
The different levels of parallelism for execution of a query
Figure 1: Optimization strategies for TPC-H Q6
provide additional opportunities for fine tuning the opera-
tions but has the complexity of selecting optimal execution
path. As the decision of top level parallelism influences the • Materialize extracts the selected column values using
subsequent levels, selection of the right execution path for a the given bitmap input.
given query is critical. However, the important drawback of
this multi-level parallelism model is the search space explo- • Arithmetic performs arithmetic operation over col-
sion. There are various options available for any given level umn values.
thereby having multiple combinations in total for selection.
This search space of parallelism has to be traversed for find- • Reduce performs aggregation of values.
ing the optimal execution path. Deciding the optimal path
of a single operation in a query can be complex (e.g, join The flow of execution of query6 using these primitives
order optimization) which in addition with new dimensions is given in in Figure 1. The figure also illustrates different
of multiple devices increases the complexity further. Hence, optimizations that can be done for a simple query like in the
newer methods for exploring the various optimization op- figure. We discuss about the various optimization strategies
portunities are to be determined. in the subsequent sections.
4. ADAPTABLE DBMS 4.1 Granularity of Operation
One of the main challenges in the proposed adaptable sys-
The mentioned challenges require a DBMS architecture
tem is the level of granularity required for optimized process-
that efficiently handles the diversity in both functionality
ing. Based on the capabilities of the devices, we could either
and hardware. Based on the challenges, we have found areas
run a few complex operation or split them into more gran-
to be explored for designing an adaptive DBMS.
ular sub-operations and then also execute those ins parallel
For a better explanation of the challenges we use the TPC-
among multiple devices.
H query6 as our motivating example. The query selects data
At the top level, each database operation acts as a
from multiple columns, performs multiplication of results
set of primitives connected together to provide a final
and outputs the aggregate. These three operations are in
result. The more granular a function is split, the more
turn executed using multiple granular primitive functions.
hardware sensitiveness comes into play. For example, the
The different primitives used for processing the given query
access patterns in CPU and GPU are different for efficient
are,
processing. Further, database operations are data centric
• Selection primitive selects the values from the given where every operation is applied to a massive amount of
column. Bitmaps are used as output format to reduce data. To aid parallel data processing, we propose the use
the data transfer size, as each bit carries the selection of explicitly data parallel primitives to be combined into
information of single value. complete DBMS operations. There are many works on
primitive based DBMS query processing. He et al., propose
• Logical Operation primitives performs logical func- multiple primitives such as Split, Filter, etc., for GPUs [7].
tions on the bitmaps produced by the different selec- Other primitives such as prefix-sum and its variants, scatter
tions. and gather are also proposed for efficient data parallel
execution [6]. This approach provides a fail safe: when a processing to exploit the hardware capabilities.
newer device is added the primitives could still run on them Also in an abstract level, characteristics of the primitive
with minor changes to the functionality. This availability of itself can affect system throughput. The choice output for-
different granular levels provide additional benefit enabling mat and the number of intermediate steps are some of the
developer to replace the inefficient fine-granular primitives characteristics that influence the overall system. For exam-
with custom coarse-granular ones. ple, using bitmap results from selection in external devices
will be generally more efficient than transferring complete
column.
4.2 Code Fusion 4.5 Device-Related Parameter Tuning
Implementing primitives in multiple granular levels be- Finally, once we have decided on the device and its corre-
comes time consuming. Hence, code could be generated sponding function to execute, certain device related param-
at runtime for the given granularity level of the operation. eters like global and local work group sizes have to be tuned
This code for execution in an individual device is generated for further improvement of the overall efficiency. These de-
by combing the primitives for the corresponding device into vice related parameters are tuned for efficiency by moni-
single execution process. This reduces the overhead of ma- toring the performance of execution. There is a feedback
terializing data from intermediate steps. loop from the devices, providing execution specific informa-
tionused for tuning the primitive for higher efficiency.
Other than these above mentioned challenges, one of the
major challenge is to formulate an order for using the strate-
gies to extract an efficient execution plan. Since all the
strategy mentioned above are inter-dependent, selection of
one depends on the other. In order to have a standardized
execution flow, we propose an architecture that has all the
necessary components used for using the above strategies.
5. CONCEPTUAL ARCHITECTURE
Figure 2: Code synthesis - selection As mentioned earlier, the overall efficiency of processing a
query in a heterogeneous environment requires all the men-
For example, three selection predicates as shown in Fig- tioned optimization strategies to be applied to a given query.
ure 2 can be either run in different devices (left) and the To aid this, we propose a DBMS architecture that provides
results are combined using the logical operations, or the a structure to handle the optimization from global abstrac-
predicates are all combined into single execution (right). tion to local device specific levels. The structure is shown
in Figure 3.
4.3 In-Device Cache
The current data-transfer bottleneck is between the main
memory and the processing devices itself. CPU has faster
access than other devices as it is directly linked to the main
memory, whereas in case of the co-processors, data must be
transported via connections with higher latency and pos-
sibly more limited bandwidth, such as PCIexpress. Thus,
even highly efficient GPUs can have sub-optimal perfor-
mance than CPUs due limited access capabilities to main
memory. Hence, using device memory as data cache is cru-
cial for high compute throughput. In contrast to this, these
external devices have limited memory. Hence, it is not al-
ways possible to store all the necessary data on the device
itself. Thus, the host system must determine the hot set of
data to be stored in the device memory using the execution
plan for the given query and monitoring the data transfer
to the device and . Figure 3: Conceptual Architecture
4.4 Execution Variants The given query is first subjected to the general logical
Each primitive selected for executing a given query can optimization and access path selection steps. Global opti-
have different characteristics to choose from based on the mization is done over the resultant query from logical and
executing device. For example, complex branching state- access path selection steps. This step determines the level of
ments are handled efficiently by CPUs, whereas GPUs are granularity for the given query. once selected, these granu-
capable of massive thread level parallelism with less control lar operations are provided to their respective device based
flow. In addition, the data access pattern must be selected on decision given using hardware supervisor. Finally, lo-
the memory architecture of the given device. For example, cal optimization is done the granular operations to tune for
coalesced data access provides efficient memory access in their respective devices and the kernels that always work
GPU. Finally, hardware specific vectorization of DBMS op- together are combined together. The components used for
erations (SIMD) is also an important parameter in database these optimizations done are discussed below,
Global Optimizer: The global optimizer processes the 0.08
Selection
complete query and provides the best execution plan for the Logical
Time (in secs)
whole query. It decides on the level of granularity to be 0.06 Materialize
used as primitive. In addition to the different granularities, Arithmetic
the parallelism strategies (i.e. pipeline and data) are also 0.04
Reduce
Baseline
selected here.
The different schema available for executing a single query
0.02
leads to search space explosion. Traversing the whole design
space might be time consuming and hence a machine learn-
ing based cost estimation algorithm is used [4]. 0
P -
P -
ar -
P
C P
G P
ne ne
G M U
U
Hardware Supervisor: The hardware supervisor pro-
F
U
U
+ -
+ -
SD
SD
Li eli
P
P
D
P
U P
G
C
C
P MD
P D
as
vides statistical information about the underlying devices.
B
U
This helps in improving the decisions made by the global
C
optimizer. It combines the characteristics of individual de-
vices into a single integrated system view. This also com-
municates with the devices and supervises execution of op- Figure 4: Execution model Variants- results
erations in the individual devices.
Storage Manager: The storage manager provides in-
formation about the location and availability of data to be From the results, we see that the single device execution
processed. This aids in determining the transfer costs in or- model for CPU has the lowest efficiency for processing Q6
der to aid selecting the execution device. Also, it is evident and is even slower than the linear execution variant. This is
that not all devices have direct access to the main memory. due to the additional materialization step to be performed.
Hence, it is the task of storage manager to partition and In case of single device execution of the query in GPU, the
transfer data to the respective devices. system is nearly 2.5x faster than the CPU variant and 2x
Device Manager: Each device manager has two sub- faster than the scalar version.
components: Monitor and Local optimizer. Monitors pro- For the multi device pipelined model, we see the CPU se-
vide device specific information and the local optimizer holds lection with GPU reduce variant is 2x slower than its coun-
information about the primitives implemented in the corre- terpart. The selection phase in CPU takes considerable time
sponding device and also about the current workload. It for processing the select, logical and materialize primitives,
uses these information to perform device specific optimiza- whereas GPU selection higher execution time only for ma-
tions to further increase the processing efficiency of a given terializing the values.
operation. Finally, we see that cross-device functional parallelism
model has the highest efficiency in processing the query.
6. PRELIMINARY EVALUATION This is mainly due to the multiple selection predicates avail-
able in the query. The latency of execution is reduced by
To evaluate the efficiency of different parallelism mecha- executing the selection and materialization steps in paral-
nisms, we executed the TPC-H query 6 by combining five lel. The detailed information of the execution of individual
different primitives namely, Bitmap, Logical, Materialize, primitives in this variant is shown in Figure 5.
Arithmetic and Reduce. All these primitives are data paral-
lel and are implemented using OpenCL. The execution path
Selection
for the query is shown in Figure 1. For our evaluation, we
Time (in millisec)
Logical
considered four different execution models as explained be- 20
Materialize
low.
Arithmetic
Baseline linear execution: In the baseline version, we
Reduce
execute the linear Q6 compiled query without parallelism
10 Wait
or primitives. The result of this execution are used as a
benchmark to compare with other parallel implementations.
Single Device Primitives (SDP): In singular device
primitive version, the parallel primitives mentioned above 0
CPU GPU
are executed in a single device. The results for complete
execution of parallel primitives in both CPU and GPU are
recorded for analysis. Figure 5: Cross-device parallelism - results
Multiple Device Pipelined (MDP): In the multi-
ple device pipelined variant, we split the query into two From the chart, we see the devices wait at multiple in-
phases: selection and aggregation and execute them in a stances for the other to finish to continue processing the
pipeline. We perform selection in CPU and aggregation in query. In case of selection and materialization, GPU waits
GPU (MDP - CPU + GPU) and vice versa (MDP - CPU + until CPU has processed its values before executing the next
GPU) recording their results. results. Also, the CPU is idle when the GPU is computing
Cross-Device Functional Parallel (CDFP): Finally, the results of arithmetic, logical and aggregation operations.
the given query is split into functional units and the inde- From these results, we infer that using functional par-
pendent units are executed concurrently in the devices. allelism enhances the efficiency of query processing. The
All these models are executed on a machine running advantage of functional parallelism comes with the disad-
Ubuntu OS version 16.04 and gcc version 5.4.0 with Intel vantage of synchronization overheads due to differences in
Core i5 CPU and Nvidia Geforce 1050 Ti GPU. processing speed among the different devices.
7. RELATED WORK [2] S. Borkar and A. A. Chien. The Future of
Karnagel et al., have explored the adaptivity in DBMS Microprocessors. Communications of the ACM, pages
using primitives for executing a query [10]. They group a 67–77, 2011.
subset of primitives to be executed in a single device into [3] S. Breß, M. Heimel, N. Siegmund, L. Bellatreche, and
an execution island and process them. Their also use device G. Saake. GPU-accelerated database systems: Survey
level caching to reduce transfer overhead. Once the inter- and open challenges. Transactions on Large-Scale
mediate result for an island is computed, an intermediate Data- and Knowledge-Centered Systems, pages 1–35,
estimation step is done to select the subsequent devices. In 2014.
our method, the execution path is given by an optimizer and [4] S. Breß and G. Saake. Why it is time for a HyPE.
is executed in by the devices. Proceedings of the International Conference on Very
In terms of granularity of operators, He et al,. have given Large Databases, 6(12):1398–1403, 2013.
a comprehensive set of data parallel primitives that can be [5] D. Broneske, S. Breß, M. Heimel, and G. Saake.
ported into various hardwares [7]. Our research comple- Toward hardware-sensitive database operations. In
ments theirs by adding new primitives and additional func- Proceedings of the International Conference on
tionalities to the already defined ones. Similarly, Pirk et al,. Extending Database Technology, pages 229–234, 2014.
have also given an abstracted set of primitives that could be [6] Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd,
used in various platforms [11]. and J. Manferdelli. Fast Scan Algorithms on Graphics
Processors. In Proceedings of the Annual International
8. CONCLUSION Conference on Supercomputing, pages 205–213, 2008.
We detailed in this paper, the need for an adaptive ar- [7] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju,
chitecture for DBMS that can be easily modified based on Q. Luo, and P. V. Sander. Relational Query
the underlying hardware and the software functionalities. Coprocessing on Graphics Processors. ACM
In this adaptable DBMS the executable operations must be Transactions on Database Systems, pages
generalized for high interoperability whereas, device specific 21:1—-21:39, 2009.
operations are needed for higher efficiency. Along with chal- [8] J. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang,
lenge in selecting the right abstraction level, there are multi- E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng,
ple challenges available for an adaptable DBMS in a hetero- K. Li, and A. Kumar. The MADlib Analytics Library
geneous environment. Our main contribution in this work or MAD Skills, the SQL. Proceedings of the
is the framework for overcoming these challenges with the International Conference on Very Large Databases,
concepts listed below, pages 1700–1711, 2012.
[9] C. Ji, Y. Li, W. Qiu, U. Awada, and K. Li. Big data
• Granular levels for DBMS operations processing in cloud computing environments.
Proceedings of the International Symposium on
• Device specific code generation Pervasive Systems, Algorithms and Networks, pages
• In-device data caching techniques 17–23, 2012.
[10] T. Karnagel, D. Habich, and W. Lehner. Adaptive
• Device and functional variants of operator Work Placement for Query Processing on
Heterogeneous Computing Resources. Proceedings of
• Hardware and functionality based tunable parameters the International Conference on Very Large Databases,
pages 733–744, 2017.
The interfacing of different components of DBMS is a chal-
[11] H. Pirk, O. Moll, M. Zaharia, and S. Madden. Voodoo
lenging task in itself. A plug’n’play architecture in DBMS
- A vector algebra for portable database performance
removes these overheads by providing interfaces for support-
on modern hardware. Proceedings of the International
ing additional hardwares and softwares. Also, an adaptable
Conference on Very Large Databases, pages
DBMS could additionally help in optimizing a new function-
1707–1718, 2016.
ality that is formed by combing the given set of granular
primitives as the primitives are in itself tuned for efficiency. [12] K.-U. Sattler and O. Dunemann. Sql database
Finally, this adaptive architecture of DBMS de-couples the primitives for decision tree classifiers. In Proceedings
functional and device based execution layers thereby provid- of the International Conference on Information and
ing independence between the operation and its correspond- Knowledge Management, pages 379–386, 2001.
ing execution unit.
9. ACKNOWLEDGMENTS
This work was partially funded by the DFG (grant no.:
SA 465/51-1 and PI 447/9)
10. REFERENCES
[1] P. Bakkum and K. Skadron. Accelerating SQL
database operations on a GPU with CUDA.
Proceedings of the Workshop on General-Purpose
Computation on Graphics Processing Units, pages
94–103, 2010.