Heterogeneity-Aware Query Optimization Tomas Karnagel Supervised by Wolfgang Lehner Database Technology Group Technische Universität Dresden, Germany tomas.karnagel@tu-dresden.de ABSTRACT and define optimization concepts before proposing an ideal The hardware landscape is changing from homogeneous sys- system setup. We are currently in the implementation and tems towards multiple heterogeneous computing units within evaluation phase where we apply our optimizations within one system. For database systems, this is an opportunity to multiple open-source database systems. accelerate query processing if the heterogeneous resources can be utilized efficiently. For this goal, we investigate novel 2. MOTIVATION AND DIRECTION query optimization concepts for heterogeneous resources like As a starting point, we would like to present a single op- placement granularity, execution estimation, optimization erator case-study to motivate the direction of our research. granularity, and data handling. In the end, we combine these concepts in a specialized optimization stage during 2.1 Case Study: Group-By Operator query optimization together with a unique way of evaluat- For our case study, we use a hash-based group-by opera- ing our optimizations in existing database systems. tor on different CUs, implemented in OpenCL. The applied hash-table uses FNV1a as hash function and a fill factor of 1. INTRODUCTION 0.5, assuming the amount of groups is known from the opti- In the recent past, the database system’s performance has mizer. We implement the operator to scan only one column mainly been bound by disk accesses. With increasing main while storing the group name and a count value, as it would memory sizes, the bottleneck shifts towards computation as be used for the following SQL query: more and more data can be kept close to the processor. SELECT num, count(*) FROM numbers GROUP BY num; To increase the computational performance in homogeneous environments, parallel execution on multiple cores has been The input values (64 MB, 16.7 mio values) are in a range of studied. However, recent systems are becoming more and [0, #group) while being randomly distributed within the in- more heterogeneous, including different types of computing put column. We store the input column in the system’s main units (CUs) to improve efficiency and energy consumption, memory (RAM) and evaluate the full execution runtime in- ideally preventing dark silicon [2]. cluding zero-copy accesses, where the data is streamed to The main challenge for database systems is to adapt to the CU on demand. When executing the operator, we see the new heterogeneous hardware environment with its differ- several effects leading to partly severe performance issues ences in computing unit architectures, memory hierarchies, (Figure 1). In previous work [8], we explained the effects for and connections to the main memory. a Nvidia GPU in detail: Previous research has been mainly about porting oper- 1. The spikes are created by high hashing contention that ators to new hardware platforms like GPUs and FPGAs. mainly occurs using FNV1a with certain hash-table While this is important, single operators do not represent sizes and data distributions. full database systems with complex architectures and a vari- 2. For #groups <100, we see problems with atomic ac- ety of workloads. In recent work, full database systems with cesses because many threads try to update a small GPU support have been proposed [1, 3, 4, 9]. These sys- number of hash-table buckets simultaneously. tems allow detailed evaluation of heterogeneous execution, 3. For hash-tables >1.5 MB, the hash-table does not fit however, most of them do not understand the underlying in the GPU’s L2 cache for fast execution. hardware but merely execute a query on a predefined CU . 4. For hash-tables >2 GB, the execution experiences a In our work, we want to investigate dedicated query op- great slow-down through TLB cache problems. timization for heterogeneous computing resources. For this, Out of all effects, the spikes are the only ones that can the system needs to, first, understand the underlying hard- be seen on all CUs, since they are software-based issues. ware environment and to, second, utilize it automatically in For all CUs, they are occurring repeatably at exactly the the best possible way during query processing. We motivate same positions, however, the height of the spikes depend on our direction of research with a single-operator case study the CU . The other effects and the overall performance dif- fer greatly, which is caused by different cache sizes, different connection to the system (e.g., PCIe2 or 3), or entirely dif- Proceedings of the VLDB 2016 PhD Workshop, September 9, 2016. New ferent architectures. Comparing all 3 executions, no single Delhi, India. Copyright (c) 2016 for this paper by its authors. Copying permitted for CU is superior to the others. For our experiment, we tested private and academic purposes. more than 7000 different group sizes, where the Nvidia GPU (a) Nvidia K80 GPU (PCIe3) (b) AMD HD7950 GPU (PCIe2) (c) Intel Xeon Phi 7120 (PCIe2) Figure 1: Group-by operator on three different CUs. was the fastest in 71.4% of all cases, followed by the AMD to performance differences, however, they are not affecting GPU with 22.5%, and the Intel Xeon Phi with 6.1%. The the design of the heterogeneity-aware query optimizer. Xeon Phi will become more important for larger hash-tables Memory heterogeneity. At this point, we are not look- (>2GB) since the runtime is scaling much better. ing at different memory types such as non-volatile memory vs. volatile memory or SSD vs. HDD. Memory types are im- 2.2 Implementation Approaches for portant for persistence and recovery consideration, however, Heterogeneity-aware Database Systems we are focusing our research on compute heterogeneity. Based on the performance differences and the effects in our Distributed systems and network heterogeneity. case study, we see two directions to implement a database At the moment we are looking at single node systems with system using heterogeneous computing resources. a scale up approach by adding more CUs. However, our The first approach would be, to choose a single CU , e.g., findings can be easily reused in a distributed environment, the Nvidia GPU, and optimize the operator for ideal ex- where we can map transfer costs between CUs to transfer ecution on this particular CU . Previously, we did this for costs between nodes and a node can consist of multiple CUs. the group-by operator [8] by adjusting execution parame- ters and implementing algorithmic changes together with 3. OPTIMIZATION CONCEPTS an integrated optimizer to define the ideal configuration. The main part of this thesis is identifying and investi- For a full system approach, these adjustments need to be gating optimizer design choices to make database systems done for every operator in consideration of data sizes and heterogeneity-aware. As starting point, we assume a column data distribution, resulting in a high number of fine-grained based database system with a column-at-a-time approach optimizations. Once this huge effort is made, it probably re- since we mainly want to focus on large OLAP queries. In sults in the best possible performance for the supported CU , the following, we want to present multiple design choices and however, it is not portable. To support a different hardware brief discussions on the most promising directions. Please setup, the optimization effort for each database operator has refer to the cited papers for more details. to be revisited, adjusted, and fine-tuned. This can only be done by large development teams, while limiting the support 3.1 Placement Granularity to only a few selected CUs. As a main idea, we want to place parts of a database query The second approach, which is explored in this work, is on CUs, where they show the best execution time in con- more adaptive. Instead of understanding and optimizing ev- sideration of data transfer costs. However, the granularity ery single effect of each operator on each CU , we propose to of work, which is actually placed, needs to be defined. In support as many CUs as possible, while dynamically defin- query processing, we see three possible granularities. ing the execution location (operator placement) depending Query Granularity. One single placement decision is on the best runtime. Having multiple CUs to choose from made for a whole query, which is then executed on one gives us the opportunity to execute on the ideal CU for a CU . This can be beneficial when there are many concur- given operator and workload. For the few CUs supported by rent queries that need to be executed, so that all CUs can the first approach, the performance will be lower, because be used concurrently. the operator implementations are less optimized. However, Database Operator Granularity. One placement de- it will provide the best possible performance for any given cision is made for each database operator, leading to a het- setup of operators and CUs, without the huge effort of fine- erogeneous execution within a single query. tuning. Additionally, it is highly portable since there are no Sub-Operator Granularity. Sub-operators are reusable hard-coded hardware-specific optimizations. execution functions of an operator, e.g., a hash join may consist of a hash-table creation and a hash-table probe, and 2.3 Distinction therefore it has two sub-operators. The same hash-table cre- Following the adaptive approach, we focus on query op- ation step can be part of a hash based group-by implementa- timization for heterogeneous computing resources, instead tion. This granularity allows a fine-grained match between of building an entirely new database system. Furthermore, execution behavior and CU . there are several related topics that we specifically exclude We choose the sub-operator granularity as the most promis- at this point of time: ing approach with its fine grained decisions. In the remain- Specific operator implementations. Operator imple- der of this paper, we will use the term operator, for the mentations are important but have been researched exten- placement object to show the general applicability of the sively over the past 10 years. Different implementations lead proposed approaches. 3.2 Estimation Model an ideal system setup with heterogeneous resource optimiza- Before optimizing query processing on heterogeneous com- tion to utilize these resources in the best possible way. puting resources, the database system needs to know the execution time of operators. Traditionally, cardinality esti- 4.1 System Integration mation was used in order to find the best query plan. With The first question is the integration aspect of heteroge- different heterogeneous CUs, additional runtime-based esti- neous resource optimization within traditional query opti- mation is needed, because even same cardinalities can lead mizations. to different runtimes on different CUs. For this estimation, Execution Engine. The presented optimizations can we proposed the Heterogeneity-aware Operator Placement be implemented in the database’s execution engine, being Model (HOP)[6], which is based on unassisted learning of applied directly before an operator’s execution. We im- execution time, using interpolation between known execu- plemented and evaluated such a system [7]. However, for tions. Additionally, data transfers and scenarios with yet this approach, global optimization is not possible due to the unknown execution times are considered. missing global view. Integrated. The optimizations could be deeply inte- 3.3 Optimization Granularity grated within the database optimizer. The optimizer has all The optimization granularity defines how much knowledge the global information for hardware optimization but it also is needed for the optimization. has a sophisticated optimization framework and strategies, A local strategy would decide the placement solely for where adding heterogeneous resource optimizations would one operator at a time. The chosen placement combines the increase the optimization complexity significantly. best combination of input data transfers and actual execu- Separate Optimization Stage. We propose a middle tion. For example, assuming the data lies in main memory, path: an additional stage of query optimization. As it is a GPU is only used if data transfer and execution is faster usually the case, the database system first optimizes the than the execution on the CPU, where data does not need query plan logically using query rewriting techniques. Then to be transfered. the physical query operators are defined in the physical op- A global strategy would look beyond one operator at the timization. Afterwards, the physically optimized plan is fur- whole query plan. There, transfer costs between different ther optimized for the heterogeneous resources in a separate operators can be included in the optimization, leading to stage. The main motivation for this approach is the sep- globally optimized executions and transfers, while the local aration of concerns, that each stage can optimize indepen- strategy does not optimize beyond one operator execution. dently, allowing simpler architectures, better maintenance, To apply global optimization, the system has to consider all and reduced search spaces. operators of a query (#op) and all CUs of the system (#cu), leading to a search space of #cu#op (for example 14 mio. 4.2 Heterogeneous Resource Optimization different placements for 15 operators and 3 CUs). To cope Within the separated optimization stage for heterogeneous with this huge search space, we developed ways to reduce the resources, we are applying our concepts in several steps. We number of considered operators together with a light-weight assume to get a fully logically and physically optimized plan greedy algorithm for efficient placement optimization [5]. from the prior optimization stages. Then, we apply the fol- We implemented both strategies in an OpenCL-based data- lowing steps, which are illustrated in Figure 2: base system and compared the performance [5]. While the 1. Split up the database operators into sub-operators (as placements of both strategies are different, the execution explained in Section 3.1). times do not differ much, because long-running influential 2. Apply data access information (as in Sec. 3.4). Multi- operators are placed on the same CUs for both strategies. ple sub-operators accessing the same data can choose However, we showed that global optimization is more ro- between replicas to potentially avoid data transfers bust for inconclusive decisions where multiple operators can and read-only operators can be executed independently, benefit from each others’ placement. therefore dependencies can be reordered (Fig. 2 (2)). A writer has to wait until previous readers have fin- 3.4 Data Handling ished before updating one replica and deleting others. Normally, data handling involves transferring data to the 3. Estimate both the possible execution time for each CU where it is needed if the data is not there already. sub-operator on each CU and the transfer costs be- To enhance this naive approach, we propose to improve tween CUs. These estimations are done locally for the data movement dependent on an operator’s data access one sub-operator or transfer at a time using our model type. This can be achieved by allowing replicas of memory presented in Section 3.2. For Example, Figure 2 (3) objects on different CUs, as long as data is only read. Then, shows only the sub-operators’ execution times. different operators can access replicas of data on different 4. Finally, having all the estimated runtimes and transfer CUs, allowing parallel execution and more freedom to find times, we apply global optimization (as in Section 3.3) the ideal operator placement without being limited by high to find the placement with the overall best runtime. transfer costs. However, when an operator is updating a After applying these four steps, the heterogeneous optimizer memory object, every replica, that is not updated, has to can pass an enhanced sub-operator-based query plan with be deleted to remain consistent. assigned placement decisions to the execution engine for het- erogeneous execution. 4. IDEAL SYSTEM SETUP 4.3 Evaluation (current progress) Having investigated the possible optimization concepts for To evaluate our optimization approach, we thought about heterogeneous computing resources, we now want to define rewriting the database optimizer of heterogeneity-supporting Figure 2: Query Optimization Steps during Heterogeneous Resource Optimization DBMS like Ocelot [4] and gpuDB [9]. However, this would tion. Placement granularity was discussed for gpuQP [3], only be an isolated system-specific analysis. To broaden the where placement is done on primitives, which then build scope of our evaluation, we decided to reuse the basic tech- larger query operators. This approach is similar to our sub- nology many of these DBMS use to support heterogeneous operator granularity. Ocelot [4] and gpuDB [9] work on hardware: OpenCL. We can intercept the OpenCL commu- query-granularity, where the CU is set manually for each nication of these systems to the OpenCL driver, optimize query. We do not have any detailed information about the the given query, and execute the work heterogeneously, de- optimization granularity or the exact integration level of op- pending on the available CUs. Technically, we do this by timization for these database systems. implementing our own OpenCL driver that is loaded by the database system. Using the driver approach, the database 6. CONCLUSION code does not need to be adjusted to support our optimiza- In this thesis, we investigated heterogeneity-aware query tions. However, implementing our optimization stage into optimization within database systems. We strongly moti- an industry-size database system is left for future work. vate our direction of query operator placement with a case study using one operator and multiple CUs. For operator placement, we investigated several concepts of optimization, 5. CONTRIBUTIONS explained possible options, and defined our approach. Fi- In this section, we would like to highlight the contributions nally, we propose an ideal system setup by defining an inte- of this thesis and differentiate them from related work. gration approach and the specific steps of the optimization We base our work on many previous publications includ- stage. Our approach is implemented using existing database ing full system approaches like Ocelot [4] and gpuDB [9]. systems and an OpenCL based extension approach. These systems currently rely on a manually-specified input to define the CU , on which the whole query is executed. With our optimizer approach, we can make these systems 7. ACKNOWLEDGMENTS heterogeneity-aware and of better performance without the This work is funded by the German Research Foundation need of manual inputs. Our contributions are in detail: (DFG) within the Cluster of Excellence “Center for Advanc- 1. Providing an overall investigation for hetero- ing Electronics Dresden”. Parts of the hardware were gen- geneity-aware query optimization. Related work in- erously provided by Dresden GPU Center of Excellence. cludes the heterogeneity-aware database systems CoGaDB[1] and gpuQP [3]. Both are no explicit query optimizers but 8. REFERENCES [1] S. Breß. The Design and Implementation of CoGaDB: A actual database systems. Both systems define the placement Column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, of database operators, where the focus is more on the sys- 2014. tem design and the runtime estimation model, than on the [2] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, actual query optimization. and D. Burger. Dark silicon and the end of multicore scaling. ISCA 2011. ACM. 2. Proposing a novel decision model for runtime [3] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, based cost estimation. gpuQP [3] uses a cost per tuple and P. V. Sander. Relational Query Coprocessing on Graphics computation, which is fine tuned in a startup phase by micro Processors. ACM Trans. Database Syst., 2009. benchmarks. CoGaDB is using a learning-based approach [4] M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-Oblivious Parallelism for In-Memory Column-Stores. with spline interpolation to compute runtime estimations. PVLDB, 2013. However, only our model, using learning-based estimation [5] T. Karnagel, D. Habich, and W. Lehner. Local vs. Global on learned data points, is able to represent fine-grained be- Optimization: Operator Placement Strategies in Heterogeneous Environments. In Proceedings of the Workshops of the havior as we have seen in Section 2.1. EDBT/ICDT, 2015. 3. Investigating global optimization together with [6] T. Karnagel, D. Habich, B. Schlegel, and W. Lehner. proposing a search space reduction approach and a Heterogeneity-aware Operator Placement in Column-Store well performing greedy algorithm. To our knowledge, DBMS. Datenbank-Spektrum, 2014. [7] T. Karnagel, M. Hille, M. Ludwig, D. Habich, W. Lehner, there is no related work on global query optimization for het- M. Heimel, and V. Markl. Demonstrating efficient query erogeneous computing units. The problem does not apply processing in heterogeneous environments. In Proceedings of the for well-known query optimizations, because every operator 2014 ACM SIGMOD, New York, NY, USA. ACM. can be placed independently without allowing any pruning [8] T. Karnagel, R. Müller, and G. M. Lohman. Optimizing GPU-accelerated Group-By and Aggregation. In ADMS’15. of possible solutions. [9] Y. Yuan, R. Lee, and X. Zhang. The Yin and Yang of Processing 4. Discussing approaches for placement granu- Data Warehousing Queries on GPU Devices. Proc. VLDB larity, optimization granularity, and system integra- Endow., 2013.