Multi-Core Allocation Model for Database Systems Simone Dominico Supervised by Eduardo Cunha de Almeida Federal University of Paraná, Brazil sdominico@inf.ufpr.br Threads Core ABSTRACT Number T20 T14 T18 T15 T12 T17 T11 T19 T13 T16 T10 T6 T5 T7 T9 15 T8 Non-Uniform Memory Access (NUMA) architecture provides 0 14 a multi-task run in parallel with different access latency be- 13 Timestamp (ms) 50 12 tween the nodes. The emergence of multi-core hardware 11 offers high processing power for multi-threaded Database 10 100 9 Management Systems (DBMS). However, database threads 8 150 7 run across multiple NUMA cores without exploring the hard- 6 ware to its full potential. The impact of data movement be- 200 5 4 tween NUMA nodes is a major challenge for multi-threaded OS/MonetDB 3 250 2 DBMS. The goal of our thesis is to find out the efficient dis- 1 tribution of CPU-cores among the NUMA nodes in order to 0 mitigate the data movement. We propose an abstract core allocation mechanism for query processing based on perfor- Figure 1: Evaluating the migration between cores of mance metrics. In our preliminary results, our mechanism the threads generated by the TPC-H Q6 in a single- is able to improve data traffic ratio between nodes in up to client execution in a 4x Quad-Core NUMA machine. 3.87x with increased memory throughput in up to 27%. 1. INTRODUCTION threads to as many nodes and cores as possible to keep load The burgeoning ingestion of data requires new hardware balance, not considering the different access latencies in the to allow real-time data analysis. In our thesis we focus on NUMA architecture. The result is the constant migration Non-Uniform Memory Access (NUMA) hardware to boost of threads all over the nodes. Figure 1 shows the execution throughput of multi-threaded Database Management Sys- of the TPC-H query 6 (Q6) in MonetDB: a multi-threaded tems (DBMS) in Online Analytical Processing (OLAP). The DBMS. Our NUMA architecture consists of 4-nodes with NUMA architecture is formed by multi-core nodes with pro- a Quad-Core AMD Opteron 8387 each. The OS tries to cessors (or CPUs) attached to a memory bank. The memory keep the load balancing between the cores causing many access latency varies according to the distance between the migration of threads with more data movement between the node and the memory being accessed. NUMA nodes. When running OLAP workloads, the current The varying memory access latency of NUMA impacts load balancing approach can cause many problems: resource the performance of multi-threaded query processing. We contention, cache conflicts, cache invalidation and inefficient observed the impact of NUMA in multi-threaded DBMS data movement. executing the Volcano query parallelism model [4], like Mi- Many authors have investigated these problems by try- crosoft SQLServer, and the materialization model, like Mon- ing to control the scheduling of threads and data movement etDB. For instance, in the Volcano model, the parallelism in NUMA nodes [8, 3, 11, 5, 12, 7, 13] from within the is encapsulated in “exchange” operators in the query exe- DBMS kernel. Our thesis follows a different direction where cution: multiples threads execute the query plan and the we manage the allocation of NUMA cores from an abstract Operating System (OS) scheduler is in charge of manag- model-based mechanism, aiming to accommodate the cur- ing data and thread locality. In both models, however, the rent workload and mitigate the data movement. Abstract conventional approach is letting the OS do the mapping of models allow supporting several underlying systems with- out modifications in the source-code. Besides, an abstract model allows different resources to be monitored, for exam- ple: memory throughput and CPU load. Our mechanism in a nutshell: a dynamic mechanism computes the local opti- mum number of cores with a rule-condition-action pipeline integrated with performance monitoring. This pipeline de- fines the performance-invariant through performance thresh- Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de olds. If this performance-invariant holds, the local optimum Janeiro, Brazil. Copyright (C) 2018 for this paper by its authors. Copying permitted for number of cores is set. The main challenges to present our private and academic purposes. mechanism are, as follows: 1 S0 S1 S0 S1 S0 S1 S0 S1 1. Provide an elastic multi-core allocation mechanism for 0 1 4 5 0 1 4 5 0 1 4 5 0 1 4 5 the NUMA architecture. 2 3 6 7 2 3 6 7 2 3 6 7 2 3 6 7 2. Define an optimum number of cores based on the use 8 9 12 13 8 9 12 13 8 9 12 13 8 9 12 13 of computational resources to meet a workload. 10 11 14 15 10 11 14 15 10 11 14 15 10 11 14 15 S2 S3 S2 S3 S2 S3 S2 S3 3. Manage the allocation of cores for the execution of time time mixed workloads. (a) Sparse Mode (b) Dense Mode In this paper, we discuss our approach to tackle these three challenges, the contribution already published and cur- Figure 2: The Sparse and Dense modes over time. rent work in progress. Only the black boxes (i.e., cores) can be accessed by the OS. The red boxes are the next cores to allocate. 2. RELATED WORK current workload. We define that the allocation of CPU- Over the last years, many different approaches were pre- cores should be performed dynamically to satisfy the de- sented to improve the performance of DBMS multi-core par- mands of the workload and facilitate the OS thread schedul- allelism in NUMA architecture. [10] presents a study of the ing with the least possible negative impact on performance. NUMA effect when assigning threads to NUMA nodes in Our key idea is to consider data placement statistics from OLTP. The authors present an approach of “hardware is- the OS side to define on which node the cores will be allo- lands” to execute different OLTP deployments. They im- cated. plement a prototype within the Shore-MT storage man- ager that achieved robust OLTP throughput. However, the “hardware islands” do not scale-out and the optimum size 3.1 Elastic Multi-core Allocation Mechanism of the island is yet undetermined. The full description of our multi-core allocation mecha- [8] presents a NUMA-aware ring-based algorithm to coor- nism is presented in [2]. Our mechanism is based on Predi- dinate the movement of threads to catch up with data. The cate/Transition PetriNets (PrT). Petri Nets are powerful ab- algorithm uses two rings: an inner ring to represent the data stract models to design concurrent and distributed systems. partitions and an outer ring to represent the threads. The Our mechanism leverages Petri Nets to model general prop- outer ring rotates clockwise for all threads to access specific erties of DBMSs up against concurrent OLAP workloads. data partitions. The goal is to improve data placement and Using the abstract model we have the flexibility to check thread placement. In [3], NUMA cores are allocated one by any resource usage by the database threads. The mecha- one to mitigate access to remote memory when the OS tries nism monitors the resource usage of the worker threads on to keep data locality of MonetDB. Other storage and worker top of OS kernel facilities to decide for the allocation of CPU placement approaches are presented to mitigate the NUMA cores (e.g., cgroups, mpstat, numactl, likwid). effect in SAP Hana [11] or to build the ERIS storage engine from scratch [5]. 3.2 The Allocation of Cores All of these contributions show important efforts into mak- The challenge of the mechanism is to prevent both under- ing the DBMSs explore multi-core CPUs to their full poten- utilization or overutilization of the system by finding out the tial, but they differ from our goal as they allocate NUMA local optimum number of cores (LONC) to accommodate a cores statically to perform data and worker placement strate- given workload. In this section, we define the LONC and gies. Recently, [12] presented an adaptive NUMA-aware also the allocation modes explored by the mechanism. data placement mechanism. The mechanism decides be- tween data placement and thread stealing when load im- balance is detected. The number of working threads changes 3.2.1 The multi-core allocation modes based on the core utilization, which differs from MonetDB [3] In the mechanism, we define resource usage thresholds and SQLServer [6] that statically define the number of threads (e.g., CPU or data traffic in the memory controller) that based on the number of available cores. are used to decide when it is necessary to allocate or release In [7] is presented a database scheduler to control the dis- cores. An adaptive algorithm decides on which node to allo- patching of query fragments, called “morsels”. The “morsels” cate/release cores taking into account the accessed memory are statically pinned in specific cores to take advantage of addresses kept in a priority queue data structure. the data location and avoid data movement between the Figure 2 shows the allocation of closed (Dense) and far nodes. However, this scheduler does not take into account apart cores (Sparse). The adaptive allocation is a combi- the optimum number of cores to tackle the current workload. nation of both w.r.t. the efficient subset of cores. To find In contrast to the related work, we propose a dynamic the affinity between threads and data, our approach checks multi-core allocation mechanism to mitigate the data move- misses in the Translation Lookaside Buffer (TLB). When ment providing to the OS the efficient sub-set of NUMA a TLB miss occurs, the OS maps the thread accessing the cores to perform database thread mapping. missed address to a node keeping this information in a data structure. Over time, new threads requesting the same ad- dress range are mapped to the same node where data is 3. PROPOSED APPROACH allocated. Therefore, we refer to as “the efficient subset of Our proposal involves an elastic multi-core allocation mech- cores” the processing cores in the nodes with the requested anism to define the optimum number of cores to treat the address range. 2 Threads Core Number 5 25 T20 T14 T18 T15 T11 T12 T17 T13 T16 T19 T10 T6 T5 T7 T9 15 HT traffic (MB/s) (102) T8 4.5 L3 load misses (105) 0 14 4 20 13 3.5 Timestamp (ms) 50 12 3 15 11 2.5 10 100 2 10 9 1.5 8 1 5 150 7 6 0.5 5 0 0 200 4 OS De Sp Ad OS De Sp Ad Adaptive 3 ar ar ap ap n n /M /M se se 250 se se 2 t t on on ive ive 1 et et DB DB 0 S0 S1 S2 S3 Figure 3: The migration of the threads spawn by a single-client submitting the TPC-H Q6 supported Figure 4: Performance metrics of Q6 with a single by the adaptive mode. client in 1 GB database in MonetDB. 4 OS/MonetDB In our adaptive mode, each entry of the priority queue 3 keeps the PIDs of the active threads with their address Mem TP (GB/s) spaces and the number of pages per NUMA node. The cores 2 are allocated on nodes with more accessed pages and cores 1 with the least number of accessed pages are released. In the 0 implementation of the affinity between threads and data, 4 Adaptive/MonetDB the data structure stores the node IDs used to pin threads 3 and allocate their address space. Mem TP (GB/s) 2 3.2.2 The local optimum number of cores 1 The number of allocated cores takes into account the arith- metic CPU-load average of the active database threads. For- 00 50 100 S0 150 S1 200 S2 250 S3 300 350 mally, we define how to compute the number of cores, as follows: Figure 5: Execution time (secs) and Memory ∀ w ∃ nalloc |(thmin < u < thmax )∧p(nalloc ) ≥ p(ntotal ) (1) Throughput (GB/s) of the TPC-H execution with 256 concurrent clients in 1 GB database. To any OLAP workload w, there is a certain number of CPU cores nalloc such that the load of each core are be- tween the minimum and maximum thresholds, in which the our mechanism. Inside the mechanism, we coded the thresh- database performance p(nalloc ) is equal or better than the olds to thmin = 10 and thmax = 70 following the rules of performance p(ntotal ) with all the CPU cores available in the thumb in the literature [9] and they are kept in all the exper- hardware. The performance function p(x) relies on system iments. We experimented different thresholds, but decreas- counters provided by the OS and the database. ing thmin lets too many cores in idle state, while increasing thmax leads to contention with too many busy cores. 3.3 Mixed Workload In our initial efforts in this thesis, we focused on OLAP 4.1 Preliminary Results due to the pressure to transfer and compute large volumes Figure 3 shows the lifespan of threads of the TPC-H query of data scattered across multiple NUMA nodes [1, 2]. How- Q6 with a single client execution in a 1GB database. As ever, the elastic multi-core mechanism can designate under- expected, our mechanism limited a subset of cores required used cores or NUMA nodes to serve different types of work- to execute the query. The threads were executed in a single loads. Our next direction is to investigate how to designate NUMA node, while the OS expanded the MonetDB threads underused cores to perform different workloads on different on all nodes. NUMA nodes. We expect the abstract nature of our mecha- Figure 4 shows performance metrics to understand the nism and the information of memory usage to facilitate the impact of our mechanism in the migration of threads. The choice of the optimal node to a different workload. result shows 2× more L3 cache misses and 9× more HT traffic in the OS scheduling than our adaptive mode. With less cores available for thread scheduling, the OS made good 4. EXPERIMENTS scheduling choices resulting in less remote access, less inter- In a preliminary study, we ran the experiments with OLAP connection traffic and improved memory throughput. workload on a NUMA machine formed by 4-node with a Figure 5 shows the result of the 22 queries of the TPC- QuadCore AMD Opteron 8387 each. Nodes are intercon- H with 256 concurrent clients executing the queries in the nected by Hyper-Transport (HT) link 3.x achieving 41.6 sequence of the TPC-H in 1GB database. We present the GB/s maximum aggregate bandwidth. We implemented our impact of our mechanism in the core allocation when the prototype in C language and we compare our mechanism to data access pattern changes. In the memory throughput the Linux Debian 8 “Jessie” OS scheduling of threads spawn results, our adaptive mechanism was 41% faster than the by the MonetDB (v11.25.5) DBMS. We let all the 16 cores OS/MonetDB. We observe that the system does not use all available to the DBMS when running without the support of the nodes all the time. For instance, initially the system 3 Execution Speedup (Adaptive Mode) 6. ACKNOWLEDGMENTS This work was partly funded by CAPES and CNPQ. 1.16 1.39 1.15 1.41 1.11 1.40 1.48 1.17 1.43 1.18 1.22 1.33 1.53 1.33 1.17 1.35 1.10 1.35 1.29 1.34 1.37 1.32 1 0.8 7. REFERENCES HT/IMC ratio 0.6 [1] S. Dominico, E. C. de Almeida, and J. A. Meira. A petrinet mechanism for OLAP in NUMA. In DaMoN, 0.4 2017. 0.2 [2] S. Dominico, E. C. de Almeida, J. A. Meira, and M. A. Z. Alves. An elastic multi-core allocation 0 mechanism for database systems. In ICDE, 2018. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Queries [3] M. Gawade and M. L. Kersten. NUMA obliviousness through memory mapping. In DaMoN, 2015. OS/MonetDB Dense Sparse Adaptive [4] G. Graefe. Encapsulation of parallelism in the volcano query processing system. In SIGMOD., pages 102–111, Figure 6: Performance results for 1 GB TPC-H 1990. queries with 256 concurrent clients. On the left, [5] T. Kissinger, T. Kiefer, B. Schlegel, D. Habich, the ratio HT/IMC shows how NUMA-friendly is the D. Molka, and W. Lehner. ERIS: A NUMA-aware system (the smaller, the better). On the top is the in-memory storage engine for analytical workload. In performance speedup for adaptive mode. ADMS, 2014. [6] P. Larson, C. Clinciu, C. Fraser, E. N. Hanson, M. Mokhtar, M. Nowakiewicz, V. Papadimos, S. L. used only the nodes S0 and S2. Price, S. Rangarajan, R. Rusanu, and M. Saubhasik. Figure 6 shows the results of interconnection traffic in re- Enhancements to SQL server column stores. In lation with traffic HT/IMC ratio. In this experiment the SIGMOD, pages 1159–1168, 2013. 256 concurrent clients execute the 22 queries in random or- [7] V. Leis, P. A. Boncz, A. Kemper, and T. Neumann. der. The results show the reduction in the local/remote Morsel-driven parallelism: a NUMA-aware query per-query data traffic ratio of up 3.87x (2.47x on average). evaluation framework for the many-core age. In Overall, the adaptive mode presented the best results. For SIGMOD, pages 743–754, 2014. instance, we observed 2x smaller HT/IMC ratio than the [8] Y. Li, I. Pandis, R. Mueller, V. Raman, and G. M. OS/MonetDB for queries such as Q9 that have the largest Lohman. NUMA-aware algorithms: the case of data number of joins operations. With less cores available, threads shuffling. In CIDR, 2013. locate data in the local node more often than the current ap- [9] U. F. Minhas, R. Liu, A. Aboulnaga, K. Salem, J. Ng, proach of letting all the cores available to the OS scheduler and S. Robertson. Elastic scale-out for partition-based (i.e., OS/MonetDB). database systems. ICDEW12. [10] D. Porobic, I. Pandis, M. Branco, P. Tözün, and A. Ailamaki. OLTP on hardware islands. PVLDB, 5. CURRENT STATUS & FUTURE WORK (11), 2012. This paper describes the motivation and challenges to our [11] I. Psaroudakis, T. Scheuer, N. May, A. Sellami, and approach in the allocation of NUMA CPU-cores for paral- A. Ailamaki. Scaling up concurrent main-memory lel execution of OLAP. In our first evaluations, we discuss column-store scans: Towards adaptive NUMA-aware the impact of data movement between NUMA nodes for data and task placement. PVLDB, (12), 2015. multi-threaded DBMS that hand over the mapping of query [12] I. Psaroudakis, T. Scheuer, N. May, A. Sellami, and threads to the OS. We show the difficulty faced by the OS A. Ailamaki. Adaptive NUMA-aware data placement scheduler to keep load balance, generating a vast amount of and task scheduling for analytical workloads in data movement, interconnection traffic and cache invalida- main-memory column-stores. PVLDB, (2), 2016. tion. In the initial contribution of this thesis [2], we present [13] V. Raman, G. K. Attaluri, R. Barber, N. Chainani, an abstract model-based mechanism to support the thread D. Kalmuk, V. KulandaiSamy, J. Leenstra, scheduling and data allocation across NUMA sockets. The S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, mechanism is the first part of our approach to mitigate the R. Müller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, data movement in NUMA nodes. The preliminary results A. J. Storm, and L. Zhang. DB2 with BLU showed performance improvements when our mechanism of- acceleration: So much more than just a column store. fered to the OS only the local optimum CPU-cores, instead PVLDB, 6(11):1080–1091, 2013. of the traditional approach of making all the cores visible to the OS all the time, like in current multi-threaded DBMSs: MonetDB, SQL Server, SAP Hana and Hyper. As future directions, we plan to explore mixed workloads as we observed that OLAP not always need the entire setup of CPU-cores. Therefore, our agenda includes redesigning our abstract model to accommodate concurrent OLTP and OLAP. In particular, we plan to study the multi-core alloca- tion in cloud computing environments that can particularly benefit from our model. 4