Benchmarking inter and intra operator parallelism on contemporary desktop hardware © Kirill Smirnov, George Chernishev Saint-Petersburg University Kirill.k.smirnov@math.spbu.ru, Chernishev@gmail.com Abstract elements [5] or not to touch this issue in detail. 2) The advancement of hardware – a few In this paper we explore effects of threading generations of CPUs had been developed. The regarding inter and intra operator parallelism in improvements were not just clock speeds and a distributed database system. We review cache size growth, but also more sophisticated several well-known join techniques and ones. Notable examples are multithreading and evaluate them in multithreading environment multicore processing. using our prototype of distributed database 3) Advancement of software – mainly system and a variety of workloads. The thread/process scheduling software involved. motivation for this study of the classical Even alone items two and three can alter or algorithms is the emergence of new equipment invalidate principles of thread usage in database system. which became readily available and in The goal of this study is to determine how we particular multicore processors capable of should utilize threads designing a distributed database running several threads concurrently. which runs on a modern hardware. We re-evaluate a few classical approaches to process joins in a 1 Introduction distributed databases taking into consideration threading aspect. There are two ideas being considered: intra and Distributed database systems (DDBS) had appeared inter operation parallelism and their combinations. The more than three decades ago and since that time there is first one is how we can divide our work processing constant demand in DDBS technologies. The driving inside one operator treating our cores as separate force of this demand is the ever-growing volume of data processing elements. The latter covers matters of which require processing. The scalability of distributed executing several operators in different threads. databases gives us a promise to handle this mass of This paper is organized as follows: in section 2 we data. The present demands new effective algorithms, briefly review some crucial aspects of DDBS which are models, architectural solutions so the field remains related to our study; in section 3 we present our active and relevant. prototype of DDBS and provide some insights into In early 90s a substantial portion of research effort implementation details. Section 4 reviews several was dedicated to ascertainment of how we should possible ways of thread usage for parallelizing join divide the work (speaking more precisely a query operation in our system. Sections 5 and 6 describe our execution plan or a part of it) between processing experiments and their interpretation results respectively. elements. Extensive experiments were conducted regarding the query plan execution techniques, and a pool of knowledge was formed. 2 Related work and DDBS technology Despite these experiments, nowadays, the same overview questions had become relevant again due to the following reasons: 2.1 Overview of important technology concepts of 1) Shortage of research concerning utilization of DDBS threads and processes. Research papers usually had following distinct attitudes: use one Extensive research in this field had been conducted for process/thread per processing element, allocate decades and a lot of knowledge had been accumulated. much higher number of threads than processing In this section we will review some part of it (by giving definitions at least), which will be referred to in this Proceedings of the Spring Researcher's Colloquium study. on Database and Information Systems, Moscow, The first aspect to consider designing a distributed Russia, 2011 database is its architecture. Here we imply the hardware architecture, how the data is distributed, stored and processed on the lowest level. Most popular architectures are shared-nothing, shared-disk and shared memory [5]. This choice impacts all further decisions One of the first most prominent database systems regarding all constituent parts the design. which implemented threading was Volcano [4]. The The second aspect is the system processing model. following conclusions were presented [4]: it is There are two basic alternatives and a lot of different beneficial to use multiple threads running concurrently combinations. This choice is the choice between on the same processor. However, running each operator pipelined execution and a materialized one. The former in its own dedicated thread is unfeasible: is the policy when a tuple passes through all operators synchronization and thread switching will eat up all the of query tree without “stops”. The latter approach benefits. The same results presented here [2], also means that some “stops” are allowed. The stop is another important question is mentioned: how much essentially a blocking operator [1], e.g. an operator memory we should provide to the thread. This question which requires all tuples to carry out its task. The sort affects the number of threads which could be run operator is the blocking one, while predicate filtering effectively. In this paper [1] an adaptive query operator is the non-blocking. The pipelined execution processing is considered, and shown that each adaptive has a huge benefit over materialization-based, namely symmetric hash join operator can work effectively in all operations can be done in memory. While the separate thread. materialization often requires extensive disk operations which are caused by intermediate result being too large 3 Our system to fit into memory. Another important aspect to consider is the tuple The experiments which are described in this work were routing mechanisms. There are a few approaches, conducted using our prototype of distributed query namely each tuple passes the operators in the same engine developed by authors. This system is devised for order, or not. The latter is known as adaptive execution simple SPJ conjunctive queries [1] and is capable of [1]. The next major choice is the choice between data- serving simultaneously a substantial number of clients. driven (also called producer-consumer) and demand- The architecture of the system is described on a Pic. 1 driven (known as workflow) execution. Demand-driven and it is essentially a shared-nothing design with a execution works by “pumping” tuples from the leaves number of processing nodes distributed among network. of the tree to the root, through the operators involved. In We partition nodes into two types: master and slave. this scheme, each operator of a higher rank sends a There is only one master node per setup and it is notification request to its children to get tuples. responsible for handling incoming queries. For purpose Consequently, this request propagates to the leaf nodes, of paper’s experiments we run slave processes on the which usually deal with disk-based operations. The same computer. data-driven approach is the vice-versa: the leaf nodes Our system is based on assumption that data are pre- generate intermediate result, notify its parents and distributed and reside on slave nodes. The data are submit the result. The data-driven approach is the more collection of relations, where each attribute may be appealing one in case of distributed database system, either integer value or string. Each relation is because it has great potential for parallelization. Also, it partitioned horizontally and may be replicated. offers more freedom and convenience implementing The overall design of the system resembles a leaf operators. However, this approach requires great classical Volcano [2], where QEP (query execution care programming it, due to synchronization and work plan) is defined as a tree of various operators and distribution issues. candidate tuples have to pass them in order for result to Yet another pair of aspects to consider designing a be produced. In our system there are the following distributed database system is the operator operators: selection (actually consisting of both parallelization methods. Considering a single query selection and projection), join and cross-product. optimization task one can name the following ways: Details regarding our implementation of join operator inter- and intra-operator parallelism. Intra-operation can be found in part 4, but it is essential to note that the parallelism refers to the parallel execution of a single overall architecture was designed for non-blocking operator and inter-operation parallelism means operations, so the join operator should be non-blocking executing a multiple operators in parallel. The first too. heavily relies on the data partitioning techniques and the latter is more architectural-dependent. 2.2 Threads and processes in databases Threading and processing aspects are of critical importance and should be taken into consideration building high performance database engine. However, there is shortage of studies related to threading and its impact on operator parallelism. With the increasing popularity of multicore processors the demand for the reevaluation of these threading technologies had risen greatly. Lets briefly review the threading in databases. Picture 1: architecture The QEP generation procedure also is a very sufficient depth the blocking operators can become the unpretentious and is based entirely on heuristics, which source of a major slowdown in processing of the whole resemble classical ones described [3]. The plan query. Also, this blocking especially adversely affects generation doesn’t affect our benchmarks much because inter-operation parallelism (threads will wait for tuples we use very specific loads (described in part 5) to assess to arrive). Thus, blocking operations should be taken the performance. into consideration. Some details regarding implementation can be To parallelize this join we employ the similar found in [6]. technique: we partition the build relation into a few To conclude, we should mention that despite its fragments and each fragment is processed inside its simplicity our system is sufficient for use in our dedicated thread. benchmarks to reach our goal. Thread 1 Thread 2 ... Thread N 4 Considered approaches Gather S We considered the following approaches to incorporation of threading into join processing. Our initial algorithm was essentially nested loop join R JOIN S1, R JOIN S2, ... … R JOIN SN; algorithm [2] which was modified to cope with the S1 U S2 U...U SN = S networking issues. The main difference is the bulk processing technique which provides bufferization to R Scatter S lessen the communication burden (as opposed to per record processing). When the both sets of tuples are S received, we sort one of them and use the second as the probing one. Also, we employ in-operation caching to lessen the network stress further. This caching works as follows: if one of the participating relations is small Picture 3: join parallelization enough to fit in memory, it is put into inner operation To overcome the blocking nature of the previous cache. This baseline method is working inside a single approach we would consider a symmetric hash-join thread (all joins and cross-products are working inside operator [1]. The core idea is to have two join nodes, one thread), Pic. 2. which have different building relations. The processing Thread 1 Thread 1 Thread 2 ... Thread N happens as follows: given a tuple from relation R we probe it against build relation of S join operator and at the same time, we use this tuple in building relation of R join operator. Likewise, given a tuple from relation S JOIN we use it to probe it in R and build it in S operators. ? This join is correct due to the fact that if a tuple arrives too early (e.g. no relevant tuples from other relation had R S been built) it would be probed later, because it is built in other join operator. This type of hash join operator is a non-blocking operator. However, this advantage comes at a price – additional memory and time constraints (twice as much as a simple hash join) and Picture 2: initial join design more sophisticated programming techniques are To introduce parallelism we consider the technique required. which employs a number of threads for data processing. The core idea is to partition data supplied by one of the child nodes, while leaving the other one intact, then use different threads to process it (shown on Pic. 3). Also, we consider a hash join operator. In this case, the processing of a join operation is separated into two separate phases: a build phase and a probe phase. RJOIN: SJOIN: R put S put During the build phase, a table which was chosen as a S probed R probed build one is used in building a hash-table. Then, the next phase is essentially a traversal of a probe table to locate the match. The evident drawback of this R S technique is the temporary “freeze” of the system. This is caused by the build stage, e.g. the dependent operators should wait until the build phase is finished. This drawback is a critical issue, because it breaks [1] the pipelining execution, or this join operator becomes a Picture 4: symmetric hash join operator design blocking operator. If we are dealing with the QEP of a To implement data partitioning in our system we introduced two new operators (much alike exchange operator in volcano system), for distribution and especially important for the multithreaded algorithms, collection of the results. These operators are inserted in due to erratical thread scheduling algorithm behaviour. QEP around join operator which is going to be We had evaluated our algorithms on two distinct parallelized. We call helper threads the threads which groups of workloads, namely workloads which evaluate hold the additional join nodes which will appear as the intra-operator parallelism and inter-operator parallelism. result of our parallelization. The first ones evaluate our algorithms using QEP which contains only one join operation, which is fueled by two 5 Experiments scan operations. The Table 1 contains summarized results over all runs in this group of experiments (here we measured execution time in milliseconds). Detailed 5.1 Parameters used in experiments results are presented in histograms 1-3, where each histogram shows results for a fixed cache size. We implemented the proposed approaches and In this experiment we varied the following examined them using our system on a variety of parameters: operation cache and number of helper workloads. First we characterize the data used in our threads. The last six methods describe parallelized join experiments: methods (for threads amount of 2, 5, 10 respectively) of 1) Size of the tables are 150Mb (first type of two distinct methods: sort and hash. Sort method is a workload) plain block nested loop, where one relation’s block is 2) Selectivity of join predicate is 0.1% sorted before probing by another. 3) Each table contained three attributes: primary In multithreaded tests we observed several modes, key, integer attribute used for joining and all but one insignificant. This phenomenon can be predicate which defines selectivity. We used explained by system thread scheduler fluctuations. The only 4 byte integer data in our experiments and Table 1 presents main mode value. it is uniformly distributed. The processing details: Method/Cache size 2Mb 5Mb 10Mb 1) Number of helper threads per processing node (QEP one): 1, 2, 5, 10 Hash Join 50,698 50,773 56,600 2) Methods (node types) of join: Block NL Sort 57,502 56,567 57,452 a. Hash join b. Block nested loop (sort) Symmetric HJ 67,608 66,982 68,165 c. Block multithreaded nested loop (sort) Block NL Hash MT 2 43,969 42,185 41,615 d. Block multithreaded hash join e. Symmetric hash join Block NL Sort MT 2 44,800 43,558 42,437 3) Operation cache (also, size of bulk which is transferred from one QEP node to another): 2, Block NL Hash MT 5 44,688 42,613 41,685 5, 10 MB Block NL Sort MT 5 46,646 44,446 47,910 4) Join operators use results of a sequential scan over table (no index is used) Block NL Hash MT 10 44,281 42,235 42,385 Query type details: Block NL Sort MT 10 45,844 43,563 43,132 1) Query to evaluate intra-operation parallelism: a Table 1: Intra-operation parallelism single join of two tables 2) Query to evaluate inter-operation parallelism: a QEP which contains 2 joins and 3 tables 5.2 Hardware and software The following setup was used: Intel(R) Pentium(R) D CPU 3.00GHz, RAM 3Gb, which run GNU/Linux 2.6.35.10 x86, and we used gcc 4.5.2. We implemented our threads using standard POSIX threads, NPTL threads are used for multithreading, threads communicate via chunk of shared memory. Notifications are implemented using POSIX pipes along with select core. For hash joins we used c++0x unordered map STL container. 5.3 Experiments performed Histogram 1: 2 Mb cache Each of the considered experiments was repeated several times, high spread of the measured values was observed and average values were used. This is converges to cross-product and vice versa (join which has both tables filtered on the same field and value). Our hypothesis is the following: this behavior could be attributed to our implementation of sort-based algorithm (we implemented a quicksort algorithm) which is tightly coupled with pointers for efficiency and to a large amount of collisions which arise considering this workload. Another group of experiments we considered was the group which tested the QEP containing 2 join nodes. The goal was to run each join in separate threads. We restrict ourselves to having both join operators of the same type. The results of these experiments were the Histogram 2: 5 Mb cache following: 1) There are not much intermediate results of the first join processing to load second one in such way to see different results. So is unfeasible to use the original dataset in order to evaluate these approaches. To cope with this problem we raised scan selectivity 5 times. 2) Under the new conditions we got the following results: a. Difference between non-threaded hash join and symmetric hash join is about 5% (hash join is better). Overall performance of QEP had increased over the same case of the first group (8% loss). We presume that this Histogram 3: 10 Mb cache happens due to enabling of pipelining. The Block NL Hash MT version of the operator is b. Threaded versions worse 2.5 times constructed in such way, that hash relation is always compared to single-threaded hash join hashed and fully resides in the memory. So, it can be on the smallest buffer size (2MB). considered a true hash join. Unfortunately, now, it is unclear We can draw the following conclusions: whether our programming technique 1) Parallelization does help to speed-up join was sufficient to implement this processing on modern hardware, under several threading correctly or it is the conditions. Our initial experiments were hardware processing specifics conducted with bulk size of 1Mb and guarantee us these bad results parallelized joins had shown worse results that (different operators cause constant single-threaded ones. cache resets). We can also say for sure, 2) Increasing bulk size does increase performance, that it is not thread scheduler problem, up to some point. Our assumption is the because adding more threads does not following: there is a tradeoff between a number have any effect. of data transfers and degree of thread c. Bulk size has a huge impact on utilization, e.g. increasing amount of work one performance in such case also. It is may ensure that communication does not becoming even more evident here. happens too often, but we have to communicate Increasing bulk size increases frequently enough in order to have a benefits of performance of threaded joins almost 2 parallelism and pipelining. times (single-threaded ones get 3) Increasing number of threads further does not approximately 10%), but still threaded improve performance on dual core cpu. joins lose to single threaded ones. 4) Examining single-threaded join methods (first three mentioned operators) had shown us the 6 Conclusions and future work following: a. Hash join better than nested loop We presented results of evaluation of a set of join b. Symmetric hash join is actually worse algorithms. The results imply that thread usage is than nested loop beneficial for join processing and can give up to 50% 5) Other results (not shown here) state the increase in performance on simple queries. However, following: hash-based parallel algorithms a bit these approaches require extra programming effort, worse than sort-based ones on workload which extensive parameter tuning and heavily rely on thread scheduler. Symmetric hash join has a promise for deep QEPs over threaded joins on such hardware. This is a relatively small piece of work for this vast and evolved topic. A lot of aspects require our attention and they hadn’t been touched at all in this paper. The future work will include parallelization of symmetric hash join algorithm, deeper join trees for more thorough analysis of threading effects to pipelining, proper evaluation of scale-up, and possibly evaluation on more productive equipment. References [1] Amol Deshpande, Zachary Ives, and Vijayshankar Raman. 2007. Adaptive query processing. Found. Trends databases 1, 1 (January 2007), 1-140. [2] Goetz Graefe. 1993. Query evaluation techniques for large databases. ACM Comput. Surv. 25, 2 (June 1993), 73-169. [3] P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data (SIGMOD '79). ACM, New York, NY, USA, 23-34. [4] Donald Kossmann. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 4 (December 2000), 422-469. [5] M. Tamer Özsu and Patrick Valduriez. 1999. Principles of Distributed Database Systems (2nd Ed.). Prentice-Hall, Inc., Upper Saddle River, NJ, USA. [6] Kirill Smirnov and George Chernishev. 2011. Networking and multithreading architectural aspects of distributed dbms (in russian), Software and Systems 1(93) (March 2011), 164-168