Benchmarking inter and intra operator parallelism on
              contemporary desktop hardware

                                    © Kirill Smirnov, George Chernishev
                                       Saint-Petersburg University
                        Kirill.k.smirnov@math.spbu.ru, Chernishev@gmail.com


                      Abstract                                       elements [5] or not to touch this issue in detail.
                                                                2) The advancement of hardware – a few
    In this paper we explore effects of threading                    generations of CPUs had been developed. The
    regarding inter and intra operator parallelism in                improvements were not just clock speeds and
    a distributed database system. We review                         cache size growth, but also more sophisticated
    several well-known join techniques and                           ones. Notable examples are multithreading and
    evaluate them in multithreading environment                      multicore processing.
    using our prototype of distributed database                 3) Advancement         of     software     –     mainly
    system and a variety of workloads. The                           thread/process scheduling software involved.
    motivation for this study of the classical                  Even alone items two and three can alter or
    algorithms is the emergence of new equipment            invalidate principles of thread usage in database system.
    which became readily available and in                       The goal of this study is to determine how we
    particular multicore processors capable of              should utilize threads designing a distributed database
    running several threads concurrently.                   which runs on a modern hardware. We re-evaluate a
                                                            few classical approaches to process joins in a
1 Introduction                                              distributed databases taking into consideration threading
                                                            aspect. There are two ideas being considered: intra and
    Distributed database systems (DDBS) had appeared        inter operation parallelism and their combinations. The
more than three decades ago and since that time there is
                                                            first one is how we can divide our work processing
constant demand in DDBS technologies. The driving
                                                            inside one operator treating our cores as separate
force of this demand is the ever-growing volume of data
                                                            processing elements. The latter covers matters of
which require processing. The scalability of distributed
                                                            executing several operators in different threads.
databases gives us a promise to handle this mass of             This paper is organized as follows: in section 2 we
data. The present demands new effective algorithms,         briefly review some crucial aspects of DDBS which are
models, architectural solutions so the field remains
                                                            related to our study; in section 3 we present our
active and relevant.
                                                            prototype of DDBS and provide some insights into
    In early 90s a substantial portion of research effort
                                                            implementation details. Section 4 reviews several
was dedicated to ascertainment of how we should
                                                            possible ways of thread usage for parallelizing join
divide the work (speaking more precisely a query            operation in our system. Sections 5 and 6 describe our
execution plan or a part of it) between processing          experiments and their interpretation results respectively.
elements. Extensive experiments were conducted
regarding the query plan execution techniques, and a
pool of knowledge was formed.                               2 Related work and DDBS technology
    Despite these experiments, nowadays, the same           overview
questions had become relevant again due to the
following reasons:                                          2.1 Overview of important technology concepts of
    1) Shortage of research concerning utilization of       DDBS
        threads and processes. Research papers usually
        had following distinct attitudes: use one           Extensive research in this field had been conducted for
        process/thread per processing element, allocate     decades and a lot of knowledge had been accumulated.
        much higher number of threads than processing       In this section we will review some part of it (by giving
                                                            definitions at least), which will be referred to in this
Proceedings of the Spring Researcher's Colloquium           study.
on Database and Information Systems, Moscow,                    The first aspect to consider designing a distributed
Russia, 2011                                                database is its architecture. Here we imply the hardware
                                                            architecture, how the data is distributed, stored and
                                                            processed on the lowest level. Most popular
                                                            architectures are shared-nothing, shared-disk and shared
memory [5]. This choice impacts all further decisions              One of the first most prominent database systems
regarding all constituent parts the design.                    which implemented threading was Volcano [4]. The
    The second aspect is the system processing model.          following conclusions were presented [4]: it is
There are two basic alternatives and a lot of different        beneficial to use multiple threads running concurrently
combinations. This choice is the choice between                on the same processor. However, running each operator
pipelined execution and a materialized one. The former         in its own dedicated thread is unfeasible:
is the policy when a tuple passes through all operators        synchronization and thread switching will eat up all the
of query tree without “stops”. The latter approach             benefits. The same results presented here [2], also
means that some “stops” are allowed. The stop is               another important question is mentioned: how much
essentially a blocking operator [1], e.g. an operator          memory we should provide to the thread. This question
which requires all tuples to carry out its task. The sort      affects the number of threads which could be run
operator is the blocking one, while predicate filtering        effectively. In this paper [1] an adaptive query
operator is the non-blocking. The pipelined execution          processing is considered, and shown that each adaptive
has a huge benefit over materialization-based, namely          symmetric hash join operator can work effectively in
all operations can be done in memory. While                    the separate thread.
materialization often requires extensive disk operations
which are caused by intermediate result being too large        3 Our system
to fit into memory.
    Another important aspect to consider is the tuple          The experiments which are described in this work were
routing mechanisms. There are a few approaches,                conducted using our prototype of distributed query
namely each tuple passes the operators in the same             engine developed by authors. This system is devised for
order, or not. The latter is known as adaptive execution       simple SPJ conjunctive queries [1] and is capable of
[1]. The next major choice is the choice between data-         serving simultaneously a substantial number of clients.
driven (also called producer-consumer) and demand-             The architecture of the system is described on a Pic. 1
driven (known as workflow) execution. Demand-driven            and it is essentially a shared-nothing design with a
execution works by “pumping” tuples from the leaves            number of processing nodes distributed among network.
of the tree to the root, through the operators involved. In    We partition nodes into two types: master and slave.
this scheme, each operator of a higher rank sends a            There is only one master node per setup and it is
notification request to its children to get tuples.            responsible for handling incoming queries. For purpose
Consequently, this request propagates to the leaf nodes,       of paper’s experiments we run slave processes on the
which usually deal with disk-based operations. The             same computer.
data-driven approach is the vice-versa: the leaf nodes             Our system is based on assumption that data are pre-
generate intermediate result, notify its parents and           distributed and reside on slave nodes. The data are
submit the result. The data-driven approach is the more        collection of relations, where each attribute may be
appealing one in case of distributed database system,          either integer value or string. Each relation is
because it has great potential for parallelization. Also, it   partitioned horizontally and may be replicated.
offers more freedom and convenience implementing                   The overall design of the system resembles a
leaf operators. However, this approach requires great          classical Volcano [2], where QEP (query execution
care programming it, due to synchronization and work           plan) is defined as a tree of various operators and
distribution issues.                                           candidate tuples have to pass them in order for result to
    Yet another pair of aspects to consider designing a        be produced. In our system there are the following
distributed database system is the operator                    operators: selection (actually consisting of both
parallelization methods. Considering a single query            selection and projection), join and cross-product.
optimization task one can name the following ways:             Details regarding our implementation of join operator
inter- and intra-operator parallelism. Intra-operation         can be found in part 4, but it is essential to note that the
parallelism refers to the parallel execution of a single       overall architecture was designed for non-blocking
operator and inter-operation parallelism means                 operations, so the join operator should be non-blocking
executing a multiple operators in parallel. The first          too.
heavily relies on the data partitioning techniques and
the latter is more architectural-dependent.

2.2 Threads and processes in databases
    Threading and processing aspects are of critical
importance and should be taken into consideration
building high performance database engine. However,
there is shortage of studies related to threading and its
impact on operator parallelism. With the increasing
popularity of multicore processors the demand for the
reevaluation of these threading technologies had risen
greatly. Lets briefly review the threading in databases.                        Picture 1: architecture
    The QEP generation procedure also is a very               sufficient depth the blocking operators can become the
unpretentious and is based entirely on heuristics, which      source of a major slowdown in processing of the whole
resemble classical ones described [3]. The plan               query. Also, this blocking especially adversely affects
generation doesn’t affect our benchmarks much because         inter-operation parallelism (threads will wait for tuples
we use very specific loads (described in part 5) to assess    to arrive). Thus, blocking operations should be taken
the performance.                                              into consideration.
    Some details regarding implementation can be                  To parallelize this join we employ the similar
found in [6].                                                 technique: we partition the build relation into a few
    To conclude, we should mention that despite its           fragments and each fragment is processed inside its
simplicity our system is sufficient for use in our            dedicated thread.
benchmarks to reach our goal.                                               Thread 1    Thread 2 ... Thread N


4 Considered approaches
                                                                  Gather S
    We considered the following approaches to
incorporation of threading into join processing. Our
initial algorithm was essentially nested loop join                                                         R JOIN S1, R JOIN S2,
                                                                                                ...            … R JOIN SN;
algorithm [2] which was modified to cope with the                                                          S1 U S2 U...U SN = S
networking issues. The main difference is the bulk
processing technique which provides bufferization to
                                                                      R                 Scatter S
lessen the communication burden (as opposed to per
record processing). When the both sets of tuples are
                                                                                        S
received, we sort one of them and use the second as the
probing one. Also, we employ in-operation caching to
lessen the network stress further. This caching works as
follows: if one of the participating relations is small                    Picture 3: join parallelization
enough to fit in memory, it is put into inner operation           To overcome the blocking nature of the previous
cache. This baseline method is working inside a single        approach we would consider a symmetric hash-join
thread (all joins and cross-products are working inside       operator [1]. The core idea is to have two join nodes,
one thread), Pic. 2.                                          which have different building relations. The processing
             Thread 1       Thread 1  Thread 2 ... Thread N   happens as follows: given a tuple from relation R we
                                                              probe it against build relation of S join operator and at
                                                              the same time, we use this tuple in building relation of
                                                              R join operator. Likewise, given a tuple from relation S
                      JOIN
                                                              we use it to probe it in R and build it in S operators.
                                        ?                     This join is correct due to the fact that if a tuple arrives
                                                              too early (e.g. no relevant tuples from other relation had
    R                  S                                      been built) it would be probed later, because it is built
                                                              in other join operator. This type of hash join operator is
                                                              a non-blocking operator. However, this advantage
                                                              comes at a price – additional memory and time
                                                              constraints (twice as much as a simple hash join) and
              Picture 2: initial join design                  more sophisticated programming techniques are
    To introduce parallelism we consider the technique        required.
which employs a number of threads for data processing.
The core idea is to partition data supplied by one of the
child nodes, while leaving the other one intact, then use
different threads to process it (shown on Pic. 3).
    Also, we consider a hash join operator. In this case,
the processing of a join operation is separated into two
separate phases: a build phase and a probe phase.                             RJOIN:                   SJOIN:
                                                                               R put                    S put
During the build phase, a table which was chosen as a                        S probed                 R probed
build one is used in building a hash-table. Then, the
next phase is essentially a traversal of a probe table to
locate the match. The evident drawback of this                                   R                     S
technique is the temporary “freeze” of the system. This
is caused by the build stage, e.g. the dependent
operators should wait until the build phase is finished.
This drawback is a critical issue, because it breaks [1]
the pipelining execution, or this join operator becomes a         Picture 4: symmetric hash join operator design
blocking operator. If we are dealing with the QEP of a            To implement data partitioning in our system we
                                                              introduced two new operators (much alike exchange
operator in volcano system), for distribution and            especially important for the multithreaded algorithms,
collection of the results. These operators are inserted in   due to erratical thread scheduling algorithm behaviour.
QEP around join operator which is going to be                    We had evaluated our algorithms on two distinct
parallelized. We call helper threads the threads which       groups of workloads, namely workloads which evaluate
hold the additional join nodes which will appear as the      intra-operator parallelism and inter-operator parallelism.
result of our parallelization.                               The first ones evaluate our algorithms using QEP which
                                                             contains only one join operation, which is fueled by two
5 Experiments                                                scan operations. The Table 1 contains summarized
                                                             results over all runs in this group of experiments (here
                                                             we measured execution time in milliseconds). Detailed
5.1 Parameters used in experiments                           results are presented in histograms 1-3, where each
                                                             histogram shows results for a fixed cache size.
   We implemented the proposed approaches and                    In this experiment we varied the following
examined them using our system on a variety of               parameters: operation cache and number of helper
workloads. First we characterize the data used in our        threads. The last six methods describe parallelized join
experiments:                                                 methods (for threads amount of 2, 5, 10 respectively) of
   1) Size of the tables are 150Mb (first type of            two distinct methods: sort and hash. Sort method is a
       workload)                                             plain block nested loop, where one relation’s block is
   2) Selectivity of join predicate is 0.1%                  sorted before probing by another.
   3) Each table contained three attributes: primary             In multithreaded tests we observed several modes,
       key, integer attribute used for joining and           all but one insignificant. This phenomenon can be
       predicate which defines selectivity. We used          explained by system thread scheduler fluctuations. The
       only 4 byte integer data in our experiments and       Table 1 presents main mode value.
       it is uniformly distributed.
The processing details:
                                                              Method/Cache size            2Mb        5Mb         10Mb
   1) Number of helper threads per processing node
       (QEP one): 1, 2, 5, 10                                 Hash Join                    50,698      50,773       56,600
   2) Methods (node types) of join:
                                                              Block NL Sort                57,502      56,567       57,452
             a. Hash join
             b. Block nested loop (sort)                      Symmetric HJ                 67,608      66,982       68,165
             c. Block multithreaded nested loop (sort)
                                                              Block NL Hash MT 2           43,969      42,185       41,615
             d. Block multithreaded hash join
             e. Symmetric hash join                           Block NL Sort MT 2           44,800      43,558       42,437
   3) Operation cache (also, size of bulk which is
       transferred from one QEP node to another): 2,          Block NL Hash MT 5           44,688      42,613       41,685
       5, 10 MB                                               Block NL Sort MT 5           46,646      44,446       47,910
   4) Join operators use results of a sequential scan
       over table (no index is used)                          Block NL Hash MT 10          44,281      42,235       42,385
Query type details:                                           Block NL Sort MT 10          45,844      43,563       43,132
    1) Query to evaluate intra-operation parallelism: a                Table 1: Intra-operation parallelism
        single join of two tables
    2) Query to evaluate inter-operation parallelism: a
        QEP which contains 2 joins and 3 tables

5.2 Hardware and software
    The following setup was used: Intel(R) Pentium(R)
D CPU 3.00GHz, RAM 3Gb, which run GNU/Linux
2.6.35.10 x86, and we used gcc 4.5.2.
    We implemented our threads using standard POSIX
threads, NPTL threads are used for multithreading,
threads communicate via chunk of shared memory.
Notifications are implemented using POSIX pipes along
with select core. For hash joins we used c++0x
unordered map STL container.

5.3 Experiments performed                                                   Histogram 1: 2 Mb cache
   Each of the considered experiments was repeated
several times, high spread of the measured values was
observed and average values were used. This is
                                                                    converges to cross-product and vice versa (join
                                                                    which has both tables filtered on the same field
                                                                    and value). Our hypothesis is the following: this
                                                                    behavior could be attributed to our
                                                                    implementation of sort-based algorithm (we
                                                                    implemented a quicksort algorithm) which is
                                                                    tightly coupled with pointers for efficiency and
                                                                    to a large amount of collisions which arise
                                                                    considering this workload.
                                                               Another group of experiments we considered was
                                                           the group which tested the QEP containing 2 join nodes.
                                                           The goal was to run each join in separate threads. We
                                                           restrict ourselves to having both join operators of the
                                                           same type. The results of these experiments were the
              Histogram 2: 5 Mb cache                      following:
                                                               1) There are not much intermediate results of the
                                                                    first join processing to load second one in such
                                                                    way to see different results. So is unfeasible to
                                                                    use the original dataset in order to evaluate
                                                                    these approaches. To cope with this problem we
                                                                    raised scan selectivity 5 times.
                                                               2) Under the new conditions we got the following
                                                                    results:
                                                                          a. Difference between non-threaded hash
                                                                              join and symmetric hash join is about
                                                                              5% (hash join is better). Overall
                                                                              performance of QEP had increased
                                                                              over the same case of the first group
                                                                              (8% loss). We presume that this
               Histogram 3: 10 Mb cache                                       happens due to enabling of pipelining.
   The Block NL Hash MT version of the operator is                        b. Threaded versions worse 2.5 times
constructed in such way, that hash relation is always                         compared to single-threaded hash join
hashed and fully resides in the memory. So, it can be                         on the smallest buffer size (2MB).
considered a true hash join.                                                  Unfortunately, now, it is unclear
   We can draw the following conclusions:                                     whether our programming technique
   1) Parallelization does help to speed-up join                              was sufficient to implement this
        processing on modern hardware, under several                          threading correctly or it is the
        conditions. Our initial experiments were                              hardware         processing      specifics
        conducted with bulk size of 1Mb and                                   guarantee us these bad results
        parallelized joins had shown worse results that                       (different operators cause constant
        single-threaded ones.                                                 cache resets). We can also say for sure,
   2) Increasing bulk size does increase performance,                         that it is not thread scheduler problem,
        up to some point. Our assumption is the                               because adding more threads does not
        following: there is a tradeoff between a number                       have any effect.
        of data transfers and degree of thread                            c. Bulk size has a huge impact on
        utilization, e.g. increasing amount of work one                       performance in such case also. It is
        may ensure that communication does not                                becoming even more evident here.
        happens too often, but we have to communicate                         Increasing      bulk     size   increases
        frequently enough in order to have a benefits of                      performance of threaded joins almost 2
        parallelism and pipelining.                                           times (single-threaded ones get
   3) Increasing number of threads further does not                           approximately 10%), but still threaded
        improve performance on dual core cpu.                                 joins lose to single threaded ones.
   4) Examining single-threaded join methods (first
        three mentioned operators) had shown us the        6 Conclusions and future work
        following:
             a. Hash join better than nested loop              We presented results of evaluation of a set of join
             b. Symmetric hash join is actually worse      algorithms. The results imply that thread usage is
                  than nested loop                         beneficial for join processing and can give up to 50%
   5) Other results (not shown here) state the             increase in performance on simple queries. However,
        following: hash-based parallel algorithms a bit    these approaches require extra programming effort,
        worse than sort-based ones on workload which       extensive parameter tuning and heavily rely on thread
scheduler. Symmetric hash join has a promise for deep
QEPs over threaded joins on such hardware.
    This is a relatively small piece of work for this vast
and evolved topic. A lot of aspects require our attention
and they hadn’t been touched at all in this paper. The
future work will include parallelization of symmetric
hash join algorithm, deeper join trees for more thorough
analysis of threading effects to pipelining, proper
evaluation of scale-up, and possibly evaluation on more
productive equipment.

References
[1] Amol Deshpande, Zachary Ives, and Vijayshankar
    Raman. 2007. Adaptive query processing. Found.
    Trends databases 1, 1 (January 2007), 1-140.
[2] Goetz Graefe. 1993. Query evaluation techniques
    for large databases. ACM Comput. Surv. 25, 2
    (June 1993), 73-169.
[3] P. Griffiths Selinger, M. M. Astrahan, D. D.
    Chamberlin, R. A. Lorie, and T. G. Price. 1979.
    Access path selection in a relational database
    management system. In Proceedings of the 1979
    ACM SIGMOD international conference on
    Management of data (SIGMOD '79). ACM, New
    York, NY, USA, 23-34.
[4] Donald Kossmann. 2000. The state of the art in
    distributed query processing. ACM Comput. Surv.
    32, 4 (December 2000), 422-469.
[5] M. Tamer Özsu and Patrick Valduriez. 1999.
    Principles of Distributed Database Systems (2nd
    Ed.). Prentice-Hall, Inc., Upper Saddle River, NJ,
    USA.
[6] Kirill Smirnov and George Chernishev. 2011.
    Networking and multithreading architectural
    aspects of distributed dbms (in russian), Software
    and Systems 1(93) (March 2011), 164-168