=Paper= {{Paper |id=Vol-2126/paper9 |storemode=property |title=B-link-trees for DB/OS Co-Design |pdfUrl=https://ceur-ws.org/Vol-2126/paper9.pdf |volume=Vol-2126 |authors=Jan Mühlig |dblpUrl=https://dblp.org/rec/conf/gvd/Muhlig18 }} ==B-link-trees for DB/OS Co-Design== https://ceur-ws.org/Vol-2126/paper9.pdf
                              B-link-trees for DB/OS Co-Design

                                                              Jan Mühlig
                                                       TU Dortmund University
                                                             Germany
                                                jan.muehlig@tu-dortmund.de



ABSTRACT                                                              run as an application on top, the MxKernel is a minimal layer
With modern and future hardware, which is characterized               which can be used by both OS and DBMS as an abstraction
by many compute units, large main memories, and hetero-               layer for the underlying hardware. This software architec-
geneity, we have enough computational power, but software             ture enables the reuse of data structures and algorithms e.g.
has to adapt to these hardware changes. In this paper, we             indexes that are used by OSs for file system implementation
present some aspects of the MxKernel, which is a bare me-             [11, 18] and databases to manage the data. Furthermore, all
tal runtime focused on mentioned hardware and Database                applications running on top of the MxKernel will have a less
/ Operating System Co-Design. As a key function, the Mx-              abstract interface to the hardware than the most operating
Kernel does not use threads but a smaller unit of work as             systems offer so far. Additional to the software architecture,
abstraction model for parallel control flows. To figure out           one main aspect of the MxKernel is a control flow abstrac-
the behavior, we implemented a prototypical data structure            tion, that is not built on the common thread model and
for indexing, whose results will be presented and discussed.          pledges for a more straightforward way to avoid locks.
                                                                         This paper is organized as follows. In Section 2 we will
                                                                      describe modern hardware in more detail and discuss pro-
Keywords                                                              blematic aspects. More precise information about the con-
Databases on Modern Hardware, Database/Operating Sys-                 cept of the MxKernel will be introduced in Section 3. As a
tem Co-Design, Database Performance                                   first data structure using the MxKernel, we built a B-link-
                                                                      tree, which is presented and discussed in Section 4, before
1.   INTRODUCTION                                                     we summarize in Section 5. Finally, the next steps of the
                                                                      project respecting databases will be outlined in Section 6.
   While hardware changes in the order of many processor
cores, large amounts of main memory and complex memo-
ry architectures, software must also adopt these changes.             2.    MODERN HARDWARE
The purpose of increasing the number of processor units is a            There are several properties of current and future hard-
linear speedup improvement, which is hard to reach. One re-           ware, that we need to follow from the software’s view. This
ason therefor is given by parts of a software, that cannot be         Section will discuss some of these characteristics.
parallelized e.g. because of concurrent accesses to the same
shared resource, like shared memory, must be synchronized             2.1   Growing core count
[2, 14]. Otherwise, the software could end up in undefined               Because it is no longer feasible to increase the clock fre-
behavior such as system crashes or incorrect results [25]. In         quencies of processing units, more cores are installed on a
conclusion, this means that synchronization like locking has          chip instead in order to enable more power through paralle-
to be avoided both for Operating Systems (OSs) and app-               lism [3]. This growing number of processor cores affects ser-
lications running on top, including Database Management               vers as well as end-user systems, such as desktop computer
Systems (DBMSs).                                                      and mobile devices. In fact, today we already have manycore
   In this work, we will introduce interesting aspects of the         processors and in the future, these will continue to increase
project “MxKernel: A Bare-Metal Runtime System for Da-                [8]. Because not every part of an application can be paral-
tabase Operations in Heterogenous Many-Core Hardware”                 lelized, at that point the concurrent control flows have to
which is a joint project of the Embedded System Software              be to synchronized, for example updating a shared memory
Group and the Databases and Information Systems Group                 location or writing to the command line. While some techni-
at TU Dortmund. While common OSs allow the database to                ques like mutexes and spinlocks ensure only one control flow
                                                                      to pass that critical section, smarter algorithms use atomic
                                                                      load, store and compare operations [7] to implement wait-
                                                                      or lock-free synchronization [21]. The development of these
                                                                      algorithms is difficult and depends on the underlying data
                                                                      structure.

                                                                      2.2   Non-uniform Memory Access and cache
30th GI-Workshop on Foundations of Databases (Grundlagen von Daten-         coherence
banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany.
Copyright is held by the author/owner(s).                               With the ongoing increase of processor units, symmetric
memory architectures stopped scaling with modern many-
core systems [14]. Instead, more complex memory architec-                       DBMS                   OS Services
tures like Non-uniform memory access (NUMA) exists in
modern hardware systems. A NUMA system is divided into
several nodes on which the CPUs are attached directly to                                   MxKernel
the node’s local memory. The memory is still organized in
a coherent memory space where every core can access every                  CPU CPU            CPU CPU           GPU
memory address, but the latency differs on local and remo-                   Memory            Memory
te accesses and may be higher in the latter case. Thus, the
memory accesses matters and can be a reason for bad per-
formance [6], when obtaining data from a remote NUMA                Figure 1: Software architecture of the MxKernel.
node. To prevent the latency that occurs on remote memo-
ry access, the OS and other applications should prefer local
                                                                  the system if the database is to perform well in the presence
data. Experiments in context of DBMSs have shown that
                                                                  of other applications [12], but in most cases, only the OS
NUMA awareness can improve performance, but this requi-
                                                                  has knowledge about the status. While existing OSs in the
res knowledge about the system and interfaces to allocate
                                                                  industry are being adapted, like Linux for Oracles Exada-
local memory and place threads in defined regions [16]. Fur-
                                                                  ta Database Machine [27], some research operating systems
thermore, NUMA aware join algorithms like MPSM [1] and
                                                                  allow specialization for applications [28, 4].
Handshake join [23] have confirmed this.
                                                                     In contrast to this, the MxKernel, whose software archi-
  In terms of distributed memory, parallel computing and
                                                                  tecture is shown in Figure 1, provides a platform for both
scalability, caches respectively cache coherence are also si-
                                                                  OS and DBMS. This makes sense for various reasons. When
gnificant. While caches speed up repeated access to data,
                                                                  the operating system is the base layer for applications, the
redundant data must be synchronized through complex ca-
                                                                  hardware will be abstracted to minimize any effort in regard
che hierarchies. Since locks that are touched by many cores
                                                                  to different hardware architectures. On the other hand, the
are often implemented by using shared memory, cache cohe-
                                                                  abstraction may become prevalent and applications have fe-
rence quickly results in solutions that are not scalable [22].
                                                                  wer possibilities to control hardware resources precise. While
2.3    Heterogeneity                                              hardware changed after OSs such as Linux were published,
                                                                  it has become tedious to adopt these changes. For example,
   In addition to the growing number of computing units,          the entire Linux kernel was temporarily locked by a single
we also see an increase in heterogeneous hardware, like dif-      lock to prevent multiple threads passing the kernel in par-
ferent cores with specialized functions in a single machine       allel [5]. Another example is libnuma [17], a library that
[4]. The cores may differ in regard to energy consumption,        provides an interface for NUMA architectures but does not
performance characteristics and also the instruction set ar-      fit seamlessly into existing interfaces.
chitecture [15]. Even peripheral devices are used more often,        By providing a minimal layer for OSs and other applicati-
such as special cores based on Field-programmable gate ar-        ons like DBMSs, which normally run on top with a need for
rays (FPGAs) and Graphics processing units (GPUs), which          less abstracting interfaces of the underlying hardware, we
are for example exploited on database query processing [9].       promise advantages for both sides. For example, the DBMS
                                                                  could make a better placement of data on the hard disk,
3.    MXKERNEL                                                    improve scheduling of control flows and take more care of
  The goal of the MxKernel is to improve the performance          NUMA awareness.
of manycore systems in regard to the increasing amount of            Further, there are components that are introduced by
data. In order to achieve this, the MxKernel forms the basis      both OSs and DBMSs. Indexing techniques, for example,
for applications like DBMSs and operating system services         are used in databases to efficiently locate tuples. Even file
as well. Additional small units of work, we call them MxTasks,    systems like BeOS [11] implemented similar data structures
are used rather than threads for control flow abstraction.        and algorithms to access files in a quick way. Both use the
  This section will give an overview of the software archi-       same approach, but they can not share concrete implemen-
tecture of the kernel, the MxTasks and how to synchronize         tations because of the structure, where the database is built
them.                                                             upon the OS or vice versa. As a result, those functionali-
                                                                  ties will be held redundant, where the MxKernel’s software
3.1    Architecture                                               architecture allows sharing and reduces those redundancies.
   In most cases, operating systems are the basis for applica-
tions. The OS manages and abstracts the underlying hard-          3.2   Control flow abstraction
ware, which makes it simple to deploy applications on a wide        Threads are a well-known and heavily used method to
set of different hardware. For this purpose, OSs provide in-      abstract concurrent control flows. Many OSs, as well as pro-
terfaces for the hardware and a set of services to applications   gramming languages, implement threads, which also have so-
running on top of it. However, this endeavor also has disad-      me disadvantages. When several threads access a data struc-
vantages, for example, the OS is the only one who knows           ture at the same time and at least one of them updates the
much about running applications.                                  data, the accesses must be atomic or synchronized, e.g. by
   Particularly, DBMSs does need only a few of those provi-       mutexes or spinlocks. This could impair the scalability of
ded services and implements his own ones. This concludes          the system because in case of synchronization only a single
that “a small efficient operating system with only desired ser-   thread could pass the guarded section.
vices” [26] would be preferred by database people. Further-         Another costly aspect is scheduling and the included con-
more, it is also important to consider the overall status of      text switches of threads. Since there are mostly more threads
                                               MxTask
                Resource a                     MxTask             Resource a                            Resource a
      MxTask                    MxTask         MxTask
      MxTask    Resource b      MxTask         MxTask             Resource b               SyncTask     Resource b    SyncTask
      MxTask                    MxTask         MxTask                          MxTask       MxTask                     MxTask

   MxKernel                    MxKernel       MxKernel                       MxKernel      MxKernel                   MxKernel
       Core                       Core           Core                           Core         Core                        Core
         (a) Unsynchronized access.            (b) Synchronized by core mapping.          (c) Synchronized by Synchronize-Task.

                                          Figure 2: Concept of Synchronize-Task.


than processing units on a system, the operating system has            one resource are done by tasks assigned to the same core, the
to schedule them periodically. When a thread is suppres-               resource does not need to be protected by locks or mutual
sed by the OS for the benefit of another, the context of the           exclusion. Following this, we can synchronize multiple tasks
replaced thread has to be saved and the context of the resto-          accessing the same resource by scheduling them to the same
red thread has to be recovered. On a Linux based system, a             core without any overhead. This is shown in Figure 2b, where
context switch takes micro- up to milliseconds [20].                   both Resource a and b are mapped to the first core in the
   An alternative approach to threads, which represents a              system.
large sequence of instructions, is to split the work into smal-
ler units, named tasks. Several libraries and operating sys-           Synchronize-Task.
tems make use of this concept, e.g. Intel’s Threading Buil-               Using the core based synchronization with a fixed resour-
ding Blocks [19] and the AUTOSAR OS [10], which provides               ce to core mapping may result in an unbalanced load of the
an option for offline scheduled tasks.                                 system, where some cores may have a lot of work and others
   Also, the MxKernel uses the concept of tasks, namely Mx-            not. In order to avoid that balancing problem and to get
Tasks, to manage compute resources. The MxTasks are cha-               rid of the static task-to-core-assignment, we implemented
racterized by a run-to-completion semantic, which means                a special task for synchronization, called Synchronize-Task.
that running tasks will never be suppressed by the kernel.             Within a system where concurrent tasks access a shared re-
Thus, a task does not need his own stack, instead, all tasks           source like shown in Figure 2a, every Synchronize-Task re-
of one core can share the same stack, which minimizes the              presents such a shared resource e.g. a monitor. Every task,
costs of a context switch between two tasks. Further, the              that wants to use the shared resource, for example, to print
kernel guarantees the execution of a task per core to be ato-          some text on the monitor, needs to enqueue to the wai-
mically, whereas a common thread could be interrupted at               ting list of the Synchronize-Task, shown in Figure 2c. Af-
any time. As a consequence, all tasks scheduled to the sa-             ter that, the Synchronize-Task will register itself as ready to
me core are synchronized by definition and do not need any             run to the MxKernel. By the time the MxKernel executes the
lock. This makes it easier to synchronize conflicting tasks            Synchronize-Task as a normal MxTask, a set of tasks waiting
and simplifies the development of lock-free data structures            in the ready list of the Synchronize-Task will be executed
and algorithms.                                                        directly. To avoid too long execution times, the set of run-
   In regard to modern hardware described in Section 2, we             ning tasks within a Synchronize-Task will be restricted and
see another benefit that tasks can profit in contrast to heavy-        the Synchronize-Task will be marked as ready again if not
weight threads. Due to the usually longer life and execution           all (sub-) tasks were executed. In this way, the Synchronize-
time of threads, it is difficult to predict memory accesses and        Tasks can move around between cores to balance the load
migrate threads to the suitable NUMA region. MxTasks, on               and take care of NUMA aware execution. Moreover, when
the other hand, have a short duration of execution and in              multiple tasks accessing the same data are executed conse-
this way lesser memory accesses. This allows finer scheduling          cutive, their behavior will be more cache-friendly because
with respect to local memory requests.                                 already cached data could be used for several tasks in a di-
                                                                       rect way.
3.3    Synchronization
  Nevertheless, the access to one resource from different              4.      MICRO-BENCHMARK
MxTasks has to be synchronized. Otherwise different tasks                The task model introduced in Section 3 differs from pro-
could update one resource on different cores at the same ti-           gramming with well-known threads and looks more like an
me like shown in Figure 2a, which causes undefined behavior.           event-based and asynchronous development. To get started,
For this purpose, we present two methods.                              we opted for a B-link-tree-based index structure, with which
                                                                       we have already gained some experience in our group [24].
CPU core based synchronization.
  Based on the run-to-completion semantic of tasks, we can             4.1     B-link-tree
ensure that all tasks executed on the same CPU core are                  B-trees and their variations like B+ -trees and B-link-trees
serialized. By implication, this means when all accesses on            [13] are key-value stores and approved data structures for
   Data: tree, key, value                                            Data: node, key, value
 1 node ←− Root(tree)                                              1 if node is not leaf then
 2 while node is not leaf do                                       2     node ←− FindChild(node, key)
 3     TakeLatch(node)                                             3     EnqueueTask(node, InsertTask(key, value))
 4     n ←− FindChild(node, key)                                   4 else
 5     ReleaseLatch(node)                                          5     if node is not full then
 6     node ←− n                                                   6         Insert(node, key, value)
 7 end                                                             7     else
 8 TakeLatch(node)                                                           // split, insert and enqueue
 9 if node is not full then                                                     propagation task to parent
10     Insert(node, key, value)                                    8     end
11     ReleaseLatch(node)                                          9 end
12 else
                                                                       Algorithm 2: Insertion using task abstraction.
       // recursive split and insert
13 end
                                                                        Processor         2× Intel R Xeon R CPU E5-2690
 Algorithm 1: Thread based insert operation on a
                                                                        Base clock                              2.90GHz
 B-link-tree.
                                                                        CPU Cache       L1: 32KB / L2: 256KB / L3: 20MB
                                                                        NUMA Nodes           2 (CPUs 0 − 7 / CPUs 8 − 15)

indexing data in databases or file systems. In contrast to                Table 1: Hardware setup for experiments.
original B-trees, B-link-trees store values in leaf nodes only,
inner nodes point the way down to child nodes using keys as
fences. Additionally, and contrary to B+-trees, every node        contention. In opposition to threads, where one thread will
in a B-link-tree contains a high key, which indicates the hig-    traverse through the tree to find a leaf, multiple tasks are
hest key that node will hold, and a link to the right sibling.    used to process node by node, shown in Algorithm 2. Every
The latter allows sequential processing of the inner and leaf     insert task will start his work on the root node of the tree.
nodes. Moreover, the B-link-tree allows the split operation in    Assuming that the root node is not a leaf, the task will
consequence of a node overflow to be executed in two single       look up for the next child node and create a follow-up task
steps [13], whereby only the modified node must be protec-        for that node (lines 1-3). On the located child node, this
ted by a latch. This condition allows the B-link-tree to be       steps will be repeated, until we found a leaf. When the task
implemented with the task model by considering each node          located the wanted leaf, the key-value pair will be inserted
of the tree as a shared resource, which can be synchronized       (line 6) and the task is done. Assuming that the node has
by the methods presented above in Section 3.3.                    to be split, an insert task for the pointer to the new node
   As an example for developing with the task model in the        will be spawned at the parent node. Deviating from thread
context of the MxKernel, we will take a look at the insert        model, where the propagation of the new node would be done
operation to store a key-value pair as a record into the B-       bottom-up recursive, the MxKernel uses tasks to propagate
link-tree. Algorithm 1 shows the pseudo code of a simplified      the key-pointer pair for the new node up to the parent.
insert operation using threads. First, we will find a leaf in a      In regard to different synchronization techniques descri-
B-link-tree by traversing through different levels of the tree,   bed in Section 3.3, the EnqueueTask method (line 3) could
shown in lines 2 to 7. As we found the correct leaf, we insert    have various implementations. When using the core, based
the record composed of a key and the corresponding value,         synchronization, every node is mapped to a core based on its
represented by lines 8 to 11. In case that the leaf node is       memory address. Enqueue in this context means to mark the
full, it is split into two nodes, both filled with half of the    task as ready for the run on the mapped core so that only
key-value pairs and the separator to the new node has to          this core will process all tasks accessing the node. Other-
be propagated up to the parent node. Because this is not          wise, when the Synchronization-Task is used, every node is
important for the comparison between the two abstraction          seen as a shared resource and will be a MxTask which can
models, we neglect this step.                                     be processed by the MxKernel. In this way, every node will
   While navigating down to the leaf the record will be stored    execute the tasks that are accessing it, to avoid concurrent
in, we need to take a latch on every inner node to prevent        accesses to one node.
other threads from updating that node. Even for the leaf
node, we have to take a latch during insertion. As a result,      4.2      Results
when frequently used nodes, such as the root node, will be           All measurements are carried out on a machine using the
accessed by multiple threads, the latch data structure will       hardware presented in Table 1. While the MxKernel is run-
be touched by all of them. The lock contention will be a          ning directly on the hardware, we used an Ubuntu 17.10
big part of the execution time [28]. Another aspect is the        with a Linux Kernel 4.13.0 − 36 to measure the thread ba-
memory access in a NUMA system. When multiple threads             sed variant of our benchmark.
on different NUMA regions touch the same node, the data              As a workload, we insert 5, 000, 000 random generated 32-
will be shipped across. This can be associated with high          bit key-value pairs into the B-link-tree using a global set of
latency. By using the task model, we want to bring the code       predefined values, which are “stolen” by threads and tasks.
to the data instead of moving the data to the code.               The results of our experiments can be seen in Figure 3. A
   The concept of MxTasks and the way they are synchronized       first finding is that all measurements show a loss of perfor-
with each other gives us the possibility to avoid the lock        mance when using more than eight cores. At this point, the
                                                             B-Link-Tree                                  tures, and heterogeneity. Small units of work are used to
                     5x106                                                                                abstract control flows rather than threads.
                    4.5x106                                                                                 As a first data structure, we implemented a B-link-tree on
                     4x106                                                                                top of the MxKernel and MxTasks. Experiments have revea-
Operations/second




                    3.5x106
                                                                                                          led promising results and have shown that tasks sometimes
                     3x106
                                                                                                          scale better than threads, even there is still room for optimi-
                    2.5x106
                     2x106
                                                                                                          zations. With the software architecture outlined above, we
                    1.5x106                                                                               will obtain a better interface between software and hardware
                     1x106                                                                                to enable optimization in regard to modern hardware.
                    500000
                         0
                              0       2        4       6          8        10     12       14        16   6.   NEXT STEPS
                                                           Cores / Threads
                                           Linux Threads                MxTasks (Synchronize-Task)           As the experiments show, we need to find a more efficient
                              MxTasks (CPU-synchronized)
                                                                                                          data structure for task management in order to reduce the
                                                                                                          contention of shared variables. Considering the intention to
                              Figure 3: Results of the B-link-tree.                                       create a full runtime environment for databases and ope-
                                                                                                          rating systems, we must first solve basic problems such as
                                                                                                          efficient memory allocation. With MxTasks we have a data
first processor with eight cores installed is not enough and                                              structure that is allocated and destroyed at high frequencies,
some tasks and threads are scheduled to the second pro-                                                   our current usage of a global heap may be a bottleneck for
cessor, which is also a separate NUMA region. While the                                                   scalability. Similarly, there is still no clarity as to how tasks
thread benchmark (red) is fastest at four cores and slows                                                 could be ideally scheduled to available cores. Therefore, we
down when additional cores are added, all variants of the                                                 want to model dependencies among MxTasks and their access
benchmarks using MxTask gain more throughput until using                                                  patterns as metadata, connected directly to the tasks. Also,
the ninth core. At the thread variant, we have no influence                                               offline scheduling may be an opinion.
on the scheduling of threads and left it to Linux. Therefore,                                                In regard to databases, more data structures like hash
it is possible that two threads on the same core compete for                                              tables have to be implemented in order to further explore
computing time, which may slow down the benchmark.                                                        the behavior and programming of MxTasks. Furthermore, we
   We also see some differences in regard to throughput wi-                                               will add a transactional interface to the B-link-tree.
thin the two different techniques of task synchronization,
described in Section 3.3. While synchronization by Synchro-                                               7.   ACKNOWLEDGMENTS
nize-Task (blue) seems to be slower on the usage of the first
eleven cores, the core based synchronization (green) variant                                                This research was supported by the Deutsche Forschungs-
loses more speed when using 12 − 15 cores. In the end, when                                               gemeinschaft, DFG, project number TE 111/2-1.
using all 16 processing units, their throughput is approxi-
mately equivalent.                                                                                        8.   REFERENCES
   By profiling the given benchmark and both techniques for                                                [1] M.-C. Albutiu, A. Kemper, and T. Neumann.
task synchronization we have uncovered problems concer-                                                        Massively parallel sort-merge joins in main memory
ning the task queue data structure, which is used by both                                                      multi-core database systems. Proceedings of the VLDB
Synchronize-Task and the task management of the MxKer-                                                         Endowment, 5(10):1064–1075, 2012.
nel. In order to achieve a wait-free behavior of the queue,
                                                                                                           [2] G. M. Amdahl. Validity of the single processor
an atomic variable is shared between consumer and producer
                                                                                                               approach to achieving large scale computing
when just one or none item is remaining in the queue. In the
                                                                                                               capabilities. In Proceedings of the April 18-20, 1967,
case of the synchronize task, where each node is represented
                                                                                                               spring joint computer conference, pages 483–485.
by such a task, we get a lot of contention on that shared
                                                                                                               ACM, 1967.
variable, which slows down the whole application. Unbalan-
                                                                                                           [3] A. Barbalace, B. Ravindran, and D. Katz. Popcorn: a
ced workloads in which some cores have little work end up
                                                                                                               replicated-kernel os based on linux. In Proceedings of
behaving the same way. The extra effort to keep the cache
                                                                                                               the Linux Symposium, Ottawa, Canada, 2014.
coherent seems to get more expensive when more than one
NUMA region is involved which ends in poor performance                                                     [4] A. Baumann, P. Barham, P.-E. Dagand, T. Harris,
on two nodes.                                                                                                  R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and
   Nonetheless, MxTasks seems like a promising approach to                                                     A. Singhania. The multikernel: a new os architecture
make better use of modern hardware than usual threads                                                          for scalable multicore systems. In Proceedings of the
does. Although we have not made any optimizations regar-                                                       ACM SIGOPS 22nd symposium on Operating systems
ding the NUMA architecture and do not yet schedule tasks                                                       principles, pages 29–44. ACM, 2009.
in the best possible way, the MxKernel achieved a higher                                                   [5] S. P. Bhattacharya and V. Apte. A measurement
throughput on the B-link-tree.                                                                                 study of the linux tcp/ip stack performance and
                                                                                                               scalability on smp systems. In Communication System
                                                                                                               Software and Middleware, 2006. Comsware 2006. First
5.                    SUMMARY                                                                                  International Conference on, pages 1–10. IEEE, 2006.
   In this paper, we presented the MxKernel, which is a bare-                                              [6] S. Blagodurov, S. Zhuravlev, A. Fedorova, and
metal runtime system for Database/Operating System Co-                                                         A. Kamali. A case for numa-aware contention
Design. With the project, we focus on modern hardware that                                                     management on multicore systems. In Proceedings of
is characterized by many cores, complex memory architec-                                                       the 19th international conference on Parallel
     architectures and compilation techniques, pages            [25] A. Silberschatz, P. B. Galvin, and G. Gagne.
     557–558. ACM, 2010.                                             Operating system concepts essentials. John Wiley &
 [7] H.-J. Boehm and S. V. Adve. Foundations of the c++              Sons, Inc., 2014.
     concurrency memory model. In ACM SIGPLAN                   [26] M. Stonebraker. Operating system support for
     Notices, volume 43, pages 68–78. ACM, 2008.                     database management. Communications of the ACM,
 [8] S. Borkar. Thousand core chips: a technology                    24(7):412–418, 1981.
     perspective. In Proceedings of the 44th annual Design      [27] R. Weiss. A technical overview of the oracle exadata
     Automation Conference, pages 746–749. ACM, 2007.                database machine and exadata storage server. Oracle
 [9] S. Breß, H. Funke, and J. Teubner. Robust query                 White Paper. Oracle Corporation, Redwood Shores,
     processing in co-processor-accelerated databases. In            2012.
     Proceedings of the 2016 International Conference on        [28] D. Wentzlaff and A. Agarwal. Factored operating
     Management of Data, pages 1891–1906. ACM, 2016.                 systems (fos): the case for a scalable operating system
[10] K. Devika and R. Syama. An overview of autosar                  for multicores. ACM SIGOPS Operating Systems
     multicore operating system implementation.                      Review, 43(2):76–85, 2009.
     International Journal of Innovative Research in
     Science, Engineering and Technology, 2:3162–3169,
     2013.
[11] D. Giampaolo and D. Giampaolo. Practical File
     System Design. Morgan Kaufmann Publishers, 1998.
[12] J. Giceva, T.-I. Salomie, A. Schüpbach, G. Alonso,
     and T. Roscoe. Cod: Database/operating system
     co-design. In CIDR, 2013.
[13] G. Graefe et al. Modern b-tree techniques.
     Foundations and Trends R in Databases, 3(4):203–402,
     2011.
[14] J. L. Hennessy and D. A. Patterson. Computer
     architecture: a quantitative approach. Elsevier, 2011.
[15] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez.
     Core fusion: accommodating software diversity in chip
     multiprocessors. ACM SIGARCH Computer
     Architecture News, 35(2):186–197, 2007.
[16] T. Kiefer, B. Schlegel, and W. Lehner. Experimental
     evaluation of numa effects on database management
     systems. In BTW, volume 13, pages 185–204, 2013.
[17] A. Kleen. A numa api for linux. Novel Inc, 2005.
[18] P. Koruga and M. Bača. Analysis of b-tree data
     structure and its usage in computer forensics. In
     Central European Conference on Information and
     Intelligent Systems, 2010.
[19] A. Kukanov and M. J. Voss. The foundations for
     scalable multi-core software in intel threading building
     blocks. Intel Technology Journal, 11(4), 2007.
[20] C. Li, C. Ding, and K. Shen. Quantifying the cost of
     context switch. In Proceedings of the 2007 workshop on
     Experimental computer science, page 2. ACM, 2007.
[21] M. M. Michael. Safe memory reclamation for dynamic
     lock-free objects using atomic reads and writes. In
     Proceedings of the twenty-first annual symposium on
     Principles of distributed computing, pages 21–30.
     ACM, 2002.
[22] S. Ramos and T. Hoefler. Cache line aware
     optimizations for ccnuma systems. In Proceedings of
     the 24th International Symposium on
     High-Performance Parallel and Distributed
     Computing, pages 85–88. ACM, 2015.
[23] P. Roy, J. Teubner, and R. Gemulla. Low-latency
     handshake join. Proceedings of the VLDB Endowment,
     7(9):709–720, 2014.
[24] M. Schröder. Using Modern Synchronization
     Mechanisms in Databases. Master’s thesis, TU
     Dortmund, Dortmund, Germany, 2016.