=Paper=
{{Paper
|id=Vol-2126/paper9
|storemode=property
|title=B-link-trees for DB/OS Co-Design
|pdfUrl=https://ceur-ws.org/Vol-2126/paper9.pdf
|volume=Vol-2126
|authors=Jan Mühlig
|dblpUrl=https://dblp.org/rec/conf/gvd/Muhlig18
}}
==B-link-trees for DB/OS Co-Design==
B-link-trees for DB/OS Co-Design
Jan Mühlig
TU Dortmund University
Germany
jan.muehlig@tu-dortmund.de
ABSTRACT run as an application on top, the MxKernel is a minimal layer
With modern and future hardware, which is characterized which can be used by both OS and DBMS as an abstraction
by many compute units, large main memories, and hetero- layer for the underlying hardware. This software architec-
geneity, we have enough computational power, but software ture enables the reuse of data structures and algorithms e.g.
has to adapt to these hardware changes. In this paper, we indexes that are used by OSs for file system implementation
present some aspects of the MxKernel, which is a bare me- [11, 18] and databases to manage the data. Furthermore, all
tal runtime focused on mentioned hardware and Database applications running on top of the MxKernel will have a less
/ Operating System Co-Design. As a key function, the Mx- abstract interface to the hardware than the most operating
Kernel does not use threads but a smaller unit of work as systems offer so far. Additional to the software architecture,
abstraction model for parallel control flows. To figure out one main aspect of the MxKernel is a control flow abstrac-
the behavior, we implemented a prototypical data structure tion, that is not built on the common thread model and
for indexing, whose results will be presented and discussed. pledges for a more straightforward way to avoid locks.
This paper is organized as follows. In Section 2 we will
describe modern hardware in more detail and discuss pro-
Keywords blematic aspects. More precise information about the con-
Databases on Modern Hardware, Database/Operating Sys- cept of the MxKernel will be introduced in Section 3. As a
tem Co-Design, Database Performance first data structure using the MxKernel, we built a B-link-
tree, which is presented and discussed in Section 4, before
1. INTRODUCTION we summarize in Section 5. Finally, the next steps of the
project respecting databases will be outlined in Section 6.
While hardware changes in the order of many processor
cores, large amounts of main memory and complex memo-
ry architectures, software must also adopt these changes. 2. MODERN HARDWARE
The purpose of increasing the number of processor units is a There are several properties of current and future hard-
linear speedup improvement, which is hard to reach. One re- ware, that we need to follow from the software’s view. This
ason therefor is given by parts of a software, that cannot be Section will discuss some of these characteristics.
parallelized e.g. because of concurrent accesses to the same
shared resource, like shared memory, must be synchronized 2.1 Growing core count
[2, 14]. Otherwise, the software could end up in undefined Because it is no longer feasible to increase the clock fre-
behavior such as system crashes or incorrect results [25]. In quencies of processing units, more cores are installed on a
conclusion, this means that synchronization like locking has chip instead in order to enable more power through paralle-
to be avoided both for Operating Systems (OSs) and app- lism [3]. This growing number of processor cores affects ser-
lications running on top, including Database Management vers as well as end-user systems, such as desktop computer
Systems (DBMSs). and mobile devices. In fact, today we already have manycore
In this work, we will introduce interesting aspects of the processors and in the future, these will continue to increase
project “MxKernel: A Bare-Metal Runtime System for Da- [8]. Because not every part of an application can be paral-
tabase Operations in Heterogenous Many-Core Hardware” lelized, at that point the concurrent control flows have to
which is a joint project of the Embedded System Software be to synchronized, for example updating a shared memory
Group and the Databases and Information Systems Group location or writing to the command line. While some techni-
at TU Dortmund. While common OSs allow the database to ques like mutexes and spinlocks ensure only one control flow
to pass that critical section, smarter algorithms use atomic
load, store and compare operations [7] to implement wait-
or lock-free synchronization [21]. The development of these
algorithms is difficult and depends on the underlying data
structure.
2.2 Non-uniform Memory Access and cache
30th GI-Workshop on Foundations of Databases (Grundlagen von Daten- coherence
banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany.
Copyright is held by the author/owner(s). With the ongoing increase of processor units, symmetric
memory architectures stopped scaling with modern many-
core systems [14]. Instead, more complex memory architec- DBMS OS Services
tures like Non-uniform memory access (NUMA) exists in
modern hardware systems. A NUMA system is divided into
several nodes on which the CPUs are attached directly to MxKernel
the node’s local memory. The memory is still organized in
a coherent memory space where every core can access every CPU CPU CPU CPU GPU
memory address, but the latency differs on local and remo- Memory Memory
te accesses and may be higher in the latter case. Thus, the
memory accesses matters and can be a reason for bad per-
formance [6], when obtaining data from a remote NUMA Figure 1: Software architecture of the MxKernel.
node. To prevent the latency that occurs on remote memo-
ry access, the OS and other applications should prefer local
the system if the database is to perform well in the presence
data. Experiments in context of DBMSs have shown that
of other applications [12], but in most cases, only the OS
NUMA awareness can improve performance, but this requi-
has knowledge about the status. While existing OSs in the
res knowledge about the system and interfaces to allocate
industry are being adapted, like Linux for Oracles Exada-
local memory and place threads in defined regions [16]. Fur-
ta Database Machine [27], some research operating systems
thermore, NUMA aware join algorithms like MPSM [1] and
allow specialization for applications [28, 4].
Handshake join [23] have confirmed this.
In contrast to this, the MxKernel, whose software archi-
In terms of distributed memory, parallel computing and
tecture is shown in Figure 1, provides a platform for both
scalability, caches respectively cache coherence are also si-
OS and DBMS. This makes sense for various reasons. When
gnificant. While caches speed up repeated access to data,
the operating system is the base layer for applications, the
redundant data must be synchronized through complex ca-
hardware will be abstracted to minimize any effort in regard
che hierarchies. Since locks that are touched by many cores
to different hardware architectures. On the other hand, the
are often implemented by using shared memory, cache cohe-
abstraction may become prevalent and applications have fe-
rence quickly results in solutions that are not scalable [22].
wer possibilities to control hardware resources precise. While
2.3 Heterogeneity hardware changed after OSs such as Linux were published,
it has become tedious to adopt these changes. For example,
In addition to the growing number of computing units, the entire Linux kernel was temporarily locked by a single
we also see an increase in heterogeneous hardware, like dif- lock to prevent multiple threads passing the kernel in par-
ferent cores with specialized functions in a single machine allel [5]. Another example is libnuma [17], a library that
[4]. The cores may differ in regard to energy consumption, provides an interface for NUMA architectures but does not
performance characteristics and also the instruction set ar- fit seamlessly into existing interfaces.
chitecture [15]. Even peripheral devices are used more often, By providing a minimal layer for OSs and other applicati-
such as special cores based on Field-programmable gate ar- ons like DBMSs, which normally run on top with a need for
rays (FPGAs) and Graphics processing units (GPUs), which less abstracting interfaces of the underlying hardware, we
are for example exploited on database query processing [9]. promise advantages for both sides. For example, the DBMS
could make a better placement of data on the hard disk,
3. MXKERNEL improve scheduling of control flows and take more care of
The goal of the MxKernel is to improve the performance NUMA awareness.
of manycore systems in regard to the increasing amount of Further, there are components that are introduced by
data. In order to achieve this, the MxKernel forms the basis both OSs and DBMSs. Indexing techniques, for example,
for applications like DBMSs and operating system services are used in databases to efficiently locate tuples. Even file
as well. Additional small units of work, we call them MxTasks, systems like BeOS [11] implemented similar data structures
are used rather than threads for control flow abstraction. and algorithms to access files in a quick way. Both use the
This section will give an overview of the software archi- same approach, but they can not share concrete implemen-
tecture of the kernel, the MxTasks and how to synchronize tations because of the structure, where the database is built
them. upon the OS or vice versa. As a result, those functionali-
ties will be held redundant, where the MxKernel’s software
3.1 Architecture architecture allows sharing and reduces those redundancies.
In most cases, operating systems are the basis for applica-
tions. The OS manages and abstracts the underlying hard- 3.2 Control flow abstraction
ware, which makes it simple to deploy applications on a wide Threads are a well-known and heavily used method to
set of different hardware. For this purpose, OSs provide in- abstract concurrent control flows. Many OSs, as well as pro-
terfaces for the hardware and a set of services to applications gramming languages, implement threads, which also have so-
running on top of it. However, this endeavor also has disad- me disadvantages. When several threads access a data struc-
vantages, for example, the OS is the only one who knows ture at the same time and at least one of them updates the
much about running applications. data, the accesses must be atomic or synchronized, e.g. by
Particularly, DBMSs does need only a few of those provi- mutexes or spinlocks. This could impair the scalability of
ded services and implements his own ones. This concludes the system because in case of synchronization only a single
that “a small efficient operating system with only desired ser- thread could pass the guarded section.
vices” [26] would be preferred by database people. Further- Another costly aspect is scheduling and the included con-
more, it is also important to consider the overall status of text switches of threads. Since there are mostly more threads
MxTask
Resource a MxTask Resource a Resource a
MxTask MxTask MxTask
MxTask Resource b MxTask MxTask Resource b SyncTask Resource b SyncTask
MxTask MxTask MxTask MxTask MxTask MxTask
MxKernel MxKernel MxKernel MxKernel MxKernel MxKernel
Core Core Core Core Core Core
(a) Unsynchronized access. (b) Synchronized by core mapping. (c) Synchronized by Synchronize-Task.
Figure 2: Concept of Synchronize-Task.
than processing units on a system, the operating system has one resource are done by tasks assigned to the same core, the
to schedule them periodically. When a thread is suppres- resource does not need to be protected by locks or mutual
sed by the OS for the benefit of another, the context of the exclusion. Following this, we can synchronize multiple tasks
replaced thread has to be saved and the context of the resto- accessing the same resource by scheduling them to the same
red thread has to be recovered. On a Linux based system, a core without any overhead. This is shown in Figure 2b, where
context switch takes micro- up to milliseconds [20]. both Resource a and b are mapped to the first core in the
An alternative approach to threads, which represents a system.
large sequence of instructions, is to split the work into smal-
ler units, named tasks. Several libraries and operating sys- Synchronize-Task.
tems make use of this concept, e.g. Intel’s Threading Buil- Using the core based synchronization with a fixed resour-
ding Blocks [19] and the AUTOSAR OS [10], which provides ce to core mapping may result in an unbalanced load of the
an option for offline scheduled tasks. system, where some cores may have a lot of work and others
Also, the MxKernel uses the concept of tasks, namely Mx- not. In order to avoid that balancing problem and to get
Tasks, to manage compute resources. The MxTasks are cha- rid of the static task-to-core-assignment, we implemented
racterized by a run-to-completion semantic, which means a special task for synchronization, called Synchronize-Task.
that running tasks will never be suppressed by the kernel. Within a system where concurrent tasks access a shared re-
Thus, a task does not need his own stack, instead, all tasks source like shown in Figure 2a, every Synchronize-Task re-
of one core can share the same stack, which minimizes the presents such a shared resource e.g. a monitor. Every task,
costs of a context switch between two tasks. Further, the that wants to use the shared resource, for example, to print
kernel guarantees the execution of a task per core to be ato- some text on the monitor, needs to enqueue to the wai-
mically, whereas a common thread could be interrupted at ting list of the Synchronize-Task, shown in Figure 2c. Af-
any time. As a consequence, all tasks scheduled to the sa- ter that, the Synchronize-Task will register itself as ready to
me core are synchronized by definition and do not need any run to the MxKernel. By the time the MxKernel executes the
lock. This makes it easier to synchronize conflicting tasks Synchronize-Task as a normal MxTask, a set of tasks waiting
and simplifies the development of lock-free data structures in the ready list of the Synchronize-Task will be executed
and algorithms. directly. To avoid too long execution times, the set of run-
In regard to modern hardware described in Section 2, we ning tasks within a Synchronize-Task will be restricted and
see another benefit that tasks can profit in contrast to heavy- the Synchronize-Task will be marked as ready again if not
weight threads. Due to the usually longer life and execution all (sub-) tasks were executed. In this way, the Synchronize-
time of threads, it is difficult to predict memory accesses and Tasks can move around between cores to balance the load
migrate threads to the suitable NUMA region. MxTasks, on and take care of NUMA aware execution. Moreover, when
the other hand, have a short duration of execution and in multiple tasks accessing the same data are executed conse-
this way lesser memory accesses. This allows finer scheduling cutive, their behavior will be more cache-friendly because
with respect to local memory requests. already cached data could be used for several tasks in a di-
rect way.
3.3 Synchronization
Nevertheless, the access to one resource from different 4. MICRO-BENCHMARK
MxTasks has to be synchronized. Otherwise different tasks The task model introduced in Section 3 differs from pro-
could update one resource on different cores at the same ti- gramming with well-known threads and looks more like an
me like shown in Figure 2a, which causes undefined behavior. event-based and asynchronous development. To get started,
For this purpose, we present two methods. we opted for a B-link-tree-based index structure, with which
we have already gained some experience in our group [24].
CPU core based synchronization.
Based on the run-to-completion semantic of tasks, we can 4.1 B-link-tree
ensure that all tasks executed on the same CPU core are B-trees and their variations like B+ -trees and B-link-trees
serialized. By implication, this means when all accesses on [13] are key-value stores and approved data structures for
Data: tree, key, value Data: node, key, value
1 node ←− Root(tree) 1 if node is not leaf then
2 while node is not leaf do 2 node ←− FindChild(node, key)
3 TakeLatch(node) 3 EnqueueTask(node, InsertTask(key, value))
4 n ←− FindChild(node, key) 4 else
5 ReleaseLatch(node) 5 if node is not full then
6 node ←− n 6 Insert(node, key, value)
7 end 7 else
8 TakeLatch(node) // split, insert and enqueue
9 if node is not full then propagation task to parent
10 Insert(node, key, value) 8 end
11 ReleaseLatch(node) 9 end
12 else
Algorithm 2: Insertion using task abstraction.
// recursive split and insert
13 end
Processor 2× Intel R Xeon R CPU E5-2690
Algorithm 1: Thread based insert operation on a
Base clock 2.90GHz
B-link-tree.
CPU Cache L1: 32KB / L2: 256KB / L3: 20MB
NUMA Nodes 2 (CPUs 0 − 7 / CPUs 8 − 15)
indexing data in databases or file systems. In contrast to Table 1: Hardware setup for experiments.
original B-trees, B-link-trees store values in leaf nodes only,
inner nodes point the way down to child nodes using keys as
fences. Additionally, and contrary to B+-trees, every node contention. In opposition to threads, where one thread will
in a B-link-tree contains a high key, which indicates the hig- traverse through the tree to find a leaf, multiple tasks are
hest key that node will hold, and a link to the right sibling. used to process node by node, shown in Algorithm 2. Every
The latter allows sequential processing of the inner and leaf insert task will start his work on the root node of the tree.
nodes. Moreover, the B-link-tree allows the split operation in Assuming that the root node is not a leaf, the task will
consequence of a node overflow to be executed in two single look up for the next child node and create a follow-up task
steps [13], whereby only the modified node must be protec- for that node (lines 1-3). On the located child node, this
ted by a latch. This condition allows the B-link-tree to be steps will be repeated, until we found a leaf. When the task
implemented with the task model by considering each node located the wanted leaf, the key-value pair will be inserted
of the tree as a shared resource, which can be synchronized (line 6) and the task is done. Assuming that the node has
by the methods presented above in Section 3.3. to be split, an insert task for the pointer to the new node
As an example for developing with the task model in the will be spawned at the parent node. Deviating from thread
context of the MxKernel, we will take a look at the insert model, where the propagation of the new node would be done
operation to store a key-value pair as a record into the B- bottom-up recursive, the MxKernel uses tasks to propagate
link-tree. Algorithm 1 shows the pseudo code of a simplified the key-pointer pair for the new node up to the parent.
insert operation using threads. First, we will find a leaf in a In regard to different synchronization techniques descri-
B-link-tree by traversing through different levels of the tree, bed in Section 3.3, the EnqueueTask method (line 3) could
shown in lines 2 to 7. As we found the correct leaf, we insert have various implementations. When using the core, based
the record composed of a key and the corresponding value, synchronization, every node is mapped to a core based on its
represented by lines 8 to 11. In case that the leaf node is memory address. Enqueue in this context means to mark the
full, it is split into two nodes, both filled with half of the task as ready for the run on the mapped core so that only
key-value pairs and the separator to the new node has to this core will process all tasks accessing the node. Other-
be propagated up to the parent node. Because this is not wise, when the Synchronization-Task is used, every node is
important for the comparison between the two abstraction seen as a shared resource and will be a MxTask which can
models, we neglect this step. be processed by the MxKernel. In this way, every node will
While navigating down to the leaf the record will be stored execute the tasks that are accessing it, to avoid concurrent
in, we need to take a latch on every inner node to prevent accesses to one node.
other threads from updating that node. Even for the leaf
node, we have to take a latch during insertion. As a result, 4.2 Results
when frequently used nodes, such as the root node, will be All measurements are carried out on a machine using the
accessed by multiple threads, the latch data structure will hardware presented in Table 1. While the MxKernel is run-
be touched by all of them. The lock contention will be a ning directly on the hardware, we used an Ubuntu 17.10
big part of the execution time [28]. Another aspect is the with a Linux Kernel 4.13.0 − 36 to measure the thread ba-
memory access in a NUMA system. When multiple threads sed variant of our benchmark.
on different NUMA regions touch the same node, the data As a workload, we insert 5, 000, 000 random generated 32-
will be shipped across. This can be associated with high bit key-value pairs into the B-link-tree using a global set of
latency. By using the task model, we want to bring the code predefined values, which are “stolen” by threads and tasks.
to the data instead of moving the data to the code. The results of our experiments can be seen in Figure 3. A
The concept of MxTasks and the way they are synchronized first finding is that all measurements show a loss of perfor-
with each other gives us the possibility to avoid the lock mance when using more than eight cores. At this point, the
B-Link-Tree tures, and heterogeneity. Small units of work are used to
5x106 abstract control flows rather than threads.
4.5x106 As a first data structure, we implemented a B-link-tree on
4x106 top of the MxKernel and MxTasks. Experiments have revea-
Operations/second
3.5x106
led promising results and have shown that tasks sometimes
3x106
scale better than threads, even there is still room for optimi-
2.5x106
2x106
zations. With the software architecture outlined above, we
1.5x106 will obtain a better interface between software and hardware
1x106 to enable optimization in regard to modern hardware.
500000
0
0 2 4 6 8 10 12 14 16 6. NEXT STEPS
Cores / Threads
Linux Threads MxTasks (Synchronize-Task) As the experiments show, we need to find a more efficient
MxTasks (CPU-synchronized)
data structure for task management in order to reduce the
contention of shared variables. Considering the intention to
Figure 3: Results of the B-link-tree. create a full runtime environment for databases and ope-
rating systems, we must first solve basic problems such as
efficient memory allocation. With MxTasks we have a data
first processor with eight cores installed is not enough and structure that is allocated and destroyed at high frequencies,
some tasks and threads are scheduled to the second pro- our current usage of a global heap may be a bottleneck for
cessor, which is also a separate NUMA region. While the scalability. Similarly, there is still no clarity as to how tasks
thread benchmark (red) is fastest at four cores and slows could be ideally scheduled to available cores. Therefore, we
down when additional cores are added, all variants of the want to model dependencies among MxTasks and their access
benchmarks using MxTask gain more throughput until using patterns as metadata, connected directly to the tasks. Also,
the ninth core. At the thread variant, we have no influence offline scheduling may be an opinion.
on the scheduling of threads and left it to Linux. Therefore, In regard to databases, more data structures like hash
it is possible that two threads on the same core compete for tables have to be implemented in order to further explore
computing time, which may slow down the benchmark. the behavior and programming of MxTasks. Furthermore, we
We also see some differences in regard to throughput wi- will add a transactional interface to the B-link-tree.
thin the two different techniques of task synchronization,
described in Section 3.3. While synchronization by Synchro- 7. ACKNOWLEDGMENTS
nize-Task (blue) seems to be slower on the usage of the first
eleven cores, the core based synchronization (green) variant This research was supported by the Deutsche Forschungs-
loses more speed when using 12 − 15 cores. In the end, when gemeinschaft, DFG, project number TE 111/2-1.
using all 16 processing units, their throughput is approxi-
mately equivalent. 8. REFERENCES
By profiling the given benchmark and both techniques for [1] M.-C. Albutiu, A. Kemper, and T. Neumann.
task synchronization we have uncovered problems concer- Massively parallel sort-merge joins in main memory
ning the task queue data structure, which is used by both multi-core database systems. Proceedings of the VLDB
Synchronize-Task and the task management of the MxKer- Endowment, 5(10):1064–1075, 2012.
nel. In order to achieve a wait-free behavior of the queue,
[2] G. M. Amdahl. Validity of the single processor
an atomic variable is shared between consumer and producer
approach to achieving large scale computing
when just one or none item is remaining in the queue. In the
capabilities. In Proceedings of the April 18-20, 1967,
case of the synchronize task, where each node is represented
spring joint computer conference, pages 483–485.
by such a task, we get a lot of contention on that shared
ACM, 1967.
variable, which slows down the whole application. Unbalan-
[3] A. Barbalace, B. Ravindran, and D. Katz. Popcorn: a
ced workloads in which some cores have little work end up
replicated-kernel os based on linux. In Proceedings of
behaving the same way. The extra effort to keep the cache
the Linux Symposium, Ottawa, Canada, 2014.
coherent seems to get more expensive when more than one
NUMA region is involved which ends in poor performance [4] A. Baumann, P. Barham, P.-E. Dagand, T. Harris,
on two nodes. R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and
Nonetheless, MxTasks seems like a promising approach to A. Singhania. The multikernel: a new os architecture
make better use of modern hardware than usual threads for scalable multicore systems. In Proceedings of the
does. Although we have not made any optimizations regar- ACM SIGOPS 22nd symposium on Operating systems
ding the NUMA architecture and do not yet schedule tasks principles, pages 29–44. ACM, 2009.
in the best possible way, the MxKernel achieved a higher [5] S. P. Bhattacharya and V. Apte. A measurement
throughput on the B-link-tree. study of the linux tcp/ip stack performance and
scalability on smp systems. In Communication System
Software and Middleware, 2006. Comsware 2006. First
5. SUMMARY International Conference on, pages 1–10. IEEE, 2006.
In this paper, we presented the MxKernel, which is a bare- [6] S. Blagodurov, S. Zhuravlev, A. Fedorova, and
metal runtime system for Database/Operating System Co- A. Kamali. A case for numa-aware contention
Design. With the project, we focus on modern hardware that management on multicore systems. In Proceedings of
is characterized by many cores, complex memory architec- the 19th international conference on Parallel
architectures and compilation techniques, pages [25] A. Silberschatz, P. B. Galvin, and G. Gagne.
557–558. ACM, 2010. Operating system concepts essentials. John Wiley &
[7] H.-J. Boehm and S. V. Adve. Foundations of the c++ Sons, Inc., 2014.
concurrency memory model. In ACM SIGPLAN [26] M. Stonebraker. Operating system support for
Notices, volume 43, pages 68–78. ACM, 2008. database management. Communications of the ACM,
[8] S. Borkar. Thousand core chips: a technology 24(7):412–418, 1981.
perspective. In Proceedings of the 44th annual Design [27] R. Weiss. A technical overview of the oracle exadata
Automation Conference, pages 746–749. ACM, 2007. database machine and exadata storage server. Oracle
[9] S. Breß, H. Funke, and J. Teubner. Robust query White Paper. Oracle Corporation, Redwood Shores,
processing in co-processor-accelerated databases. In 2012.
Proceedings of the 2016 International Conference on [28] D. Wentzlaff and A. Agarwal. Factored operating
Management of Data, pages 1891–1906. ACM, 2016. systems (fos): the case for a scalable operating system
[10] K. Devika and R. Syama. An overview of autosar for multicores. ACM SIGOPS Operating Systems
multicore operating system implementation. Review, 43(2):76–85, 2009.
International Journal of Innovative Research in
Science, Engineering and Technology, 2:3162–3169,
2013.
[11] D. Giampaolo and D. Giampaolo. Practical File
System Design. Morgan Kaufmann Publishers, 1998.
[12] J. Giceva, T.-I. Salomie, A. Schüpbach, G. Alonso,
and T. Roscoe. Cod: Database/operating system
co-design. In CIDR, 2013.
[13] G. Graefe et al. Modern b-tree techniques.
Foundations and Trends R in Databases, 3(4):203–402,
2011.
[14] J. L. Hennessy and D. A. Patterson. Computer
architecture: a quantitative approach. Elsevier, 2011.
[15] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez.
Core fusion: accommodating software diversity in chip
multiprocessors. ACM SIGARCH Computer
Architecture News, 35(2):186–197, 2007.
[16] T. Kiefer, B. Schlegel, and W. Lehner. Experimental
evaluation of numa effects on database management
systems. In BTW, volume 13, pages 185–204, 2013.
[17] A. Kleen. A numa api for linux. Novel Inc, 2005.
[18] P. Koruga and M. Bača. Analysis of b-tree data
structure and its usage in computer forensics. In
Central European Conference on Information and
Intelligent Systems, 2010.
[19] A. Kukanov and M. J. Voss. The foundations for
scalable multi-core software in intel threading building
blocks. Intel Technology Journal, 11(4), 2007.
[20] C. Li, C. Ding, and K. Shen. Quantifying the cost of
context switch. In Proceedings of the 2007 workshop on
Experimental computer science, page 2. ACM, 2007.
[21] M. M. Michael. Safe memory reclamation for dynamic
lock-free objects using atomic reads and writes. In
Proceedings of the twenty-first annual symposium on
Principles of distributed computing, pages 21–30.
ACM, 2002.
[22] S. Ramos and T. Hoefler. Cache line aware
optimizations for ccnuma systems. In Proceedings of
the 24th International Symposium on
High-Performance Parallel and Distributed
Computing, pages 85–88. ACM, 2015.
[23] P. Roy, J. Teubner, and R. Gemulla. Low-latency
handshake join. Proceedings of the VLDB Endowment,
7(9):709–720, 2014.
[24] M. Schröder. Using Modern Synchronization
Mechanisms in Databases. Master’s thesis, TU
Dortmund, Dortmund, Germany, 2016.