ceur-ws.org/Vol-2399/paper01.pdf Efficient Scale-Out Using Query-Driven Workload Distribution and Fragment Allocation Stefan Halfpap Supervised by Prof. Hasso Plattner Hasso Plattner Institute, University of Potsdam, Germany stefan.halfpap@hpi.de ABSTRACT The remainder of this paper is structured as follows: We Database replication is an approach for scaling throughput introduce the basic allocation problem in Section 2. In Sec- and ensuring high availability. Using workload knowledge, tion 3, we discuss related work. In Section 4, we summarize we are able to load-balance queries to replica nodes accord- our scalable decomposition-based approach [9] to calculate ing to the data being accessed. However, balancing the load workload distributions for large problem sizes. Further, we evenly while maximizing data reuse is a challenging alloca- demonstrate it using a real workload. Our current research tion problem. To address large-size problems, we developed and plans for future work are described in Section 5. Sec- a novel decomposition-based heuristic using linear program- tion 6 concludes the paper. ming. We compare our approach with a rule-based state- of-the-art allocation algorithm using a real-world workload comprising thousands of queries. Further, we outline how we 2. PROBLEM DESCRIPTION plan to extend our approach for versatile allocation prob- The allocation problem being examined is a coupled work- lems, e.g., considering changing workloads and robustness load distribution and data placement problem. We assume against node failures. a horizontally and/or vertically partitioned database with N disjoint data fragments. Each fragment i has a size ai , 1. INTRODUCTION i = 1, ..., N . Note, we assume a separation of the data par- Partitioning and replication are means to allow databases titioning and the allocation process, which is common [13]. to process increasing workloads. Analyses of real workloads In this way, existing workload-aware data partitioning algo- show that read-only queries account for the largest workload rithms can be used to generate the input fragments for our share [1, 11]. Scaling read-only queries is relatively simple, allocation approach. as we can execute them on read-only replicas without vio- Further, we assume a workload consisting of Q queries lating transactional consistency [6, 16]. (query classes). Each query class j is defined by the subset of Using a naive load-balancing approach, one can distribute fragments qj ✓ {1, ..., N }, j = 1, ..., Q, it accesses. Queries queries independently of the accessed data. As a result, account for workload shares, defined by query frequencies fj one has to store all data and apply all data modifications and query costs cj , j = 1, ..., Q. on all nodes. Further, queries are unlikely to profit from Last, we assume a number of nodes K to load-balance the caching e↵ects, because similar queries are not guaranteed workload evenly. In practice, the number of nodes K can be to be assigned to the same replica. chosen manually by a database administrator or automati- In our research, we investigate query-driven workload dis- cally by a replication framework with regard to the desired tributions, i.e., load-balancing queries based on their ac- query throughput. cessed data. In particular, we reduce the amount of re- We want to decide which query should be executed to quired memory on all nodes, while balancing the load evenly. which extent on a node in order to minimize the overall Query-driven workload distributions are also beneficial to memory consumption for all nodes. Processing query j on maximize caching e↵ects. Further, when adding new nodes node k requires to store all fragments qj on the node. In [9], to a database cluster, we are able to decide which data to we derive an LP model to calculate optimal solutions. load first to quickly process a large share of the workload. Figure 1 shows an example workload and an optimal so- In practice, there are varying constraints and goals to dis- lution, i.e., an even workload distribution for Q = 5 queries, tribute the workload and/or required data on nodes. By accessing N = 10 fragments, that minimizes the overall using linear programming (LP) we are able to address ver- memory consumption for a database cluster with K = 4 satile allocation problems, which include robustness against nodes. (We assume an equal size for each fragment, i.e., potential node failures or efficient reallocations to react to ai = 1, i = 1, ..., N .) changing workloads. By W/V , we denote the replication factor of an alloca- tion, where the total amount of data used W is normalized by the minimal amount of used data V . Because each of the ten fragments is accessed by at least one query, the minimal amount of used data is equal to the number of fragments Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los Angeles, California. Copyright (C) 2019 for this paper by its authors. V = N = 10. The replication factor of our exemplary allo- Copying permitted for private and academic purposes. cation for four nodes is W/V = 1.4. Client Client q1 1 2 3 4 10% q4 8 9 10 20% q2 q4 (100%) q1 (100%) 3 4 5 6 15% q5 1 30% Allocation q3 (100%) q5 (17%) q2 (100%) q5 (83%) q3 7 8 9 25% 25% 25% 25% 25% Database (V=10) Node 1 (W=3) Node 2 (W=4) Node 3 (W=6) Node 4 (W=1) 1 5 1 5 1 5 1 5 1 5 6 6 6 6 6 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 7 8 9 10 7 8 9 10 7 8 9 10 7 8 9 10 7 8 9 10 Figure 1: Workload-driven fragment allocation (based on [8]). The left-hand side of the figure visualizes the model input. The database consists of N = 10 fragments. Q = 5 queries correspond to di↵erent workload shares. Processing a query requires to store its accessed fragments. The objective is to minimize the overall memory consumption of the replication cluster while evenly balancing the load among K = 4 nodes. The right-hand side of the figure illustrates an optimal allocation with a total replication factor W/V = 1.4 and even workload distribution with 1/4 of the workload share assigned to each node. 3. RELATED WORK Our research addresses two shortcomings of Rabl and Ja- Our workload distribution problem (see Section 2) is an cobsen’s approach: (1) When ordering queries and deter- allocation problem in the field of distributed database sys- mining overlap, the algorithm does not consider the specific tems. Özsu and Valduriez [13] give an overview of related al- accessed fragments, but only their sizes. (2) When assigning location problems. We summarize their overview as follows: queries to nodes, the remaining queries are not analyzed. In (i) Constraints and optimization goals, e.g., performance, contrast, our decomposition approach (see Section 4) divides costs, and dependability, for allocation problems di↵er (see the problem into subproblems which preserves the structure also [4]). (ii) Many formulations for allocation problems are of the problem. Specifically, we regard all queries (whose proven to be NP-hard [5]. As a result, heuristics have to be load we try to divide into workload chunks) with all indi- used for large problem instances. (iii) As constraints and vidual fragments the queries access. optimization goals di↵er, heuristics are often tied to specific Solutions/allocations of our algorithm can be used for formulations of allocation problems. replicated databases, e.g., SAP HANA [12], Postgres-R [10] Our problem formulation is similar to the one presented and in replication middleware [2, 3, 14]. Further, the cal- by Rabl and Jacobsen [15]. We want to balance the load culated workload distributions support caching e↵ects for evenly to nodes to enable linear throughput scaling. Opti- systems like Amazon Aurora [17], which separate compute mal solutions via LP do not scale, e.g., for the 22 TPC-H from storage. queries and using vertical partitioning with each of the 61 Using LP to solve allocation problems is flexible compared columns as an individual fragment, we were able to calculate to rule-based heuristics. We can add or change constraints, optimal allocations for up to 8 nodes (termination after 8 h and modify the objective function to address varied alloca- using a current laptop) [9]. tion problems, e.g., requiring to store only a certain subset To address large problem sizes, with thousands of queries, of fragments (instead of all) or demanding a similar mem- fragments, and dozens of nodes, Rabl and Jacobsen propose ory consumption per node. Including such constraints into a greedy heuristic that assigns queries to nodes one after an- a rule-based heuristic is more challenging, because it is more other. In specific, they order queries by the product of their difficult to decide how and in which part to adapt the algo- workload share and the overall size of accessed fragments. rithm without losing sight of the optimization goal. Queries are then assigned to the node with the largest frag- ment overlap with already assigned queries. Nodes with no assigned queries are treated as if they have a complete over- 4. DECOMPOSITION APPROACH lap. If the load of the assigned query exceeds the capacity The complexity of our query-driven workload distribu- of a node, the query is inserted back to the assignment list tion problem increases with an increasing number of nodes, with its remaining load. queries, and fragments. It is challenging to find good heuris- We implemented Rabl and Jacobsen’s algorithm and in- tics, as minimizing data redundancy and balancing the load vestigated the steps chosen during allocations for TPC-H on additional replicas are in conflict with each other. and TPC-DS [8]. We made the following observation: Load The core idea of our decomposition heuristic is to split capacities of nodes are often filled one after another, be- the workload iteratively using easier to solve but similar cause nodes with many assigned queries are more likely to subproblems, forming a tree. In specific, we split the work- have high fragment overlap: Assigning a query to a node load into chunks of queries, which access similar fragments. (may increase the number of allocated fragments and thus) Thereby, we reduce the data redundancy in each splitting increases the probability that the next query is assigned to step. In the following, we describe the approach using an the same node unless the node’s load capacity is exhausted. example. In [9], we present the corresponding LP model. query fragments workload cj x fj q1 1 2 3 4 10% ≅ 5 x 20 Table 1: Performance comparison: data replica- q1 (100%) q2 3 4 5 6 15% ≅ 50 x 3 Top Node (V=10) q2 (100%) tion factors of the rule-based heuristic by [15] (W S ) q3 1 5 7 8 9 25% ≅ 10 x 25 6 q3 (100%) vs. our heuristic (W ), see [9], with one decompo- q4 q4 (100%) 8 9 10 20% ≅ 20 x 10 2 3 4 q5 (100%) sition level of B chunks and associated nb nodes, q5 1 7 8 9 10 30% ≅ 6 x 50 100% b = 1, ..., B, summing up to K. K chunks B nodes nb W/V solve time W/W S Chunk 1 (W=5) q3 (100%) Chunk 2 (W=6) q1 (100%) 5 q4 (100%) 5 q2 (100%) 3 2 2+1 1.81 384 s -32% 1 1 6 q5 (17%) 6 q5 (83%) 4 2 2+2 2.13 225 s -42% 2 3 4 50% 2 3 4 50% 5 2 3+2 2.60 18 min -33% 7 8 9 10 7 8 9 10 5 3 2+2+1 2.50 99 min -36% q4 (100%) q1 (100%) 6 2 3+3 2.86 13 min -37% q3 (100%) q5 (17%) q2 (100%) q5 (83%) 25% 25% 25% 25% Leaf 1 (W=3) Leaf 2 (W=4) Leaf 3 (W=6) Leaf 4 (W=1) 1 5 1 5 1 5 1 5 We used the Gurobi Optimizer [7] with a single thread. 6 6 6 6 2 3 4 2 3 4 The computation times were between 2 min and 99 min, 2 3 4 2 3 4 7 8 9 10 7 8 9 10 7 8 9 10 7 8 9 10 while focusing on reducing the replication factor. Di↵erent chunkings, e.g., 3 + 2 and 2 + 2 + 1, trade computation time for memory consumption. To reduce the computation time, Figure 2: Decomposition approach (based on [9]). we can also cluster fragments or queries, e.g., grouping the Iteratively splitting a workload for K = 2 + 2 nodes, 4423 queries with lowest costs as a single query class with leading to an even workload share of 1/4 at all an aggregated workload share lower than 5%. leaves. The total replication factor is W/V = 1.4. 5. CURRENT AND FUTURE WORK Figure 2 shows a decomposition of a workload, represented Numerical experiments show that our heuristic calculates by the top node, to K = 2 + 2 = 4 final nodes, represented allocations with lower memory consumption than the heuris- as leaves. Each parent node can have an arbitrary number tic by Rabl and Jacobsen [15] for TPC-H [9] and a real-world of child nodes. The workload share of each child must cor- workload (cf. Section 4.1). Currently, we conduct end-to-end respond to the workload share of the leaves in its subtree. evaluations by deploying the calculated fragment allocations In Figure 2, the top node has two children. Both children in a PostgreSQL cluster and measuring the query through- have two leaves in their subtree. Therefore, their workload put. In Section 5.1, we describe our preliminary insights. share is 2 ⇤ 14 = 12 . In addition, we plan to investigate the impact of di↵erent Using the decomposition approach, we can split the work- database fragmentations and query clusterings on the allo- load into an arbitrary number of final nodes with equal work- cation, especially, in the case of skewed model input (cf. Sec- load shares. Note, the existence of a solution with equal tion 4.1). When running workloads in practice, fragment al- workload shares is guaranteed, because our model (like Rabl locations and workload distributions must consider further and Jacobsen’s) allows to split query loads arbitrarily with- factors, e.g., data modification costs, robustness against fail- out regard to query frequencies and costs, which both are ures, or changing model inputs, that are not included in the discrete (see [9, 15]). basic allocation problem (cf. Section 2). We describe our With each decomposition, the problem gets easier as fewer plans to address these factors in the following sections. relevant queries and accessed fragments have to be consid- 5.1 End-to-End Evaluations ered. For chunk 1 in Figure 2, the distribution problem is shrunken to Q = 3 queries and N = 5 fragments. By choos- In practice, workloads consist of a number of active data- ing the number of child nodes, we can control the problem base connections sending a stream of queries. We model complexity. For large problem instances, it is advisable to a workload as a set of queries in a potentially long period split the top node into a low number of chunks for lowest of time. The performance of an allocation in practice de- computation times. If the problem is still too large, other pends on the query timing, especially, whether all (or at heuristic approaches, e.g., [15], can be used to split the work- least many) nodes are used for processing while there are load close to the top node. Towards the leaves, our approach pending queries. This can be supported by query schedul- can be used. ing: in case an incoming query is executable on multiple nodes, we send the query to the node with the lowest load. Further, we can calculate allocations that trade memory 4.1 Application to a Real-World Workload consumption (allocating additional fragments to nodes) for In this section, we demonstrate our heuristic using a real- flexibility (increased share of query loads to be processed world workload. We analyzed queries against an accounting by multiple replicas). Flexibility is also beneficial to handle table, comprising of N = 344 columns/fragments. We ex- imprecise query costs, e.g., caused by concurrency e↵ects at tracted Q = 4461 query classes, whose query costs are dis- runtime. Currently, we use average query execution times tributed exponentially. In particular, 38 queries account for as query costs metric, because they are easy to obtain and more than 95% of the workload share. widely applicable. We compare our solution to the greedy heuristic by Rabl and Jacobsen [15]. Table 1 summarizes the results for 3  5.2 Data Modifications K  6. Our solution reduces the replication factor in com- As a result of inserts, updates, or deletes, data may change parison to the greedy heuristic by 32-42%. over time. We can adapt our basic model to include data modification costs. We are able to model update costs ei- 7. REFERENCES ther as execution costs similar to costs for read queries (see [1] M. Boissier. Optimizing main memory utilization of also [15]), or as fragment modification costs depending on columnar in-memory databases using data eviction. In the stored fragments per node, both in a linear way. Inte- PhD@VLDB, pages 1–6, 2014. grating modification costs for the decomposition heuristic is [2] E. Cecchet, G. Candea, and A. Ailamaki. also possible. When splitting the workload into leaf nodes, Middleware-based database replication: the gaps we can include update costs precisely for node allocations. between theory and practice. In SIGMOD, pages In upper levels, we may have to approximate update costs, 739–752, 2008. because the exact number of leaf nodes for which update [3] E. Cecchet, J. Marguerite, and W. Zwaenepoel. costs occur may be unknown. C-JDBC: Flexible database clustering middleware. In 5.3 Node Failures FREENIX@USENIX, pages 9–18, 2004. [4] L. W. Dowdy and D. V. Foster. Comparative models Besides scalability, data replication enables high availabil- of the file assignment problem. ACM Comput. Surv., ity. Thereby, a database cluster can support di↵erent levels 14(2):287–313, 1982. of robustness in case of node failures. Basic robust alloca- tions ensure that each fragment is stored on multiple nodes, [5] K. P. Eswaran. Placement of records in a file and file or queries can be processed on multiple nodes [15]. How- allocation in a computer. In IFIP, pages 304–307, ever, the workload distribution after node failures can be 1974. highly skewed. In future work, we will investigate alloca- [6] J. Gray, P. Helland, P. E. O’Neil, and D. Shasha. The tions that allow an evenly balanced workload, even in the dangers of replication and a solution. In SIGMOD, case of potential node failures. pages 173–182, 1996. [7] Gurobi Optimization. https://www.gurobi.com. 5.4 Reallocation Costs [8] S. Halfpap and R. Schlosser. A comparison of Workloads (including query costs) or fragment sizes may allocation algorithms for partially replicated change. As a result, a current data allocation may not al- databases. In ICDE, pages 2008–2011, 2019. low an even workload distribution anymore or a di↵erent [9] S. Halfpap and R. Schlosser. Workload-driven allocation may reduce the memory consumption. fragment allocation for partially replicated databases If workload changes are known in advance, Rabl and Ja- using linear programming. In ICDE, pages 1746–1749, cobsen propose to calculate allocations for each scenario and 2019. merge all of the allocations to a combined allocation that is [10] B. Kemme and G. Alonso. Don’t be lazy, be robust with regard to the workload changes [15]. An alter- consistent: Postgres-R, A new way to implement native approach may reallocate fragments to match the new database replication. In VLDB, pages 134–143, 2000. workload. To avoid costly reallocations, we currently in- [11] J. Krüger, C. Kim, M. Grund, N. Satish, D. Schwalb, vestigate how to take the current allocation into account in J. Chhugani, H. Plattner, P. Dubey, and A. Zeier. the algorithm. To avoid frequent reallocations, allocations Fast updates on read-optimized databases using should be robust with regard to minor workload changes multi-core CPUs. PVLDB, 5(1):61–72, 2011. (see Section 5.1). [12] J. Lee, S. Moon, K. H. Kim, D. H. Kim, S. K. Cha, W. Han, C. G. Park, H. J. Na, and J. Lee. Parallel 5.5 Online Approach replication across formats in SAP HANA for scaling As workloads and underlying data, and thus model in- out mixed OLTP/OLAP workloads. PVLDB, puts change, we want to adapt allocations over time, ideally 10(12):1598–1609, 2017. without downtime or performance degradation. The previ- [13] M. T. Özsu and P. Valduriez. Principles of Distributed ous sections described building blocks to implement a fault- Database Systems, Third Edition. Springer, 2011. tolerant and adaptive replication cluster. We can monitor the workload to detect when we have to adapt a current al- [14] M. Patiño-Martı́nez, R. Jiménez-Peris, B. Kemme, location. At that time we can calculate a new allocation, and G. Alonso. MIDDLE-R: consistent database which is better suited to the current workload and which replication at the middleware level. ACM Trans. can be created with reasonable reallocation costs. Comput. Syst., 23(4):375–423, 2005. [15] T. Rabl and H. Jacobsen. Query centric partitioning and allocation for partially replicated database 6. CONCLUSION systems. In SIGMOD, pages 315–330, 2017. We presented the current status and plans to extend our [16] D. Schwalb, J. Kossmann, M. Faust, S. Klauck, work in the field of query-driven workload distribution and M. Uflacker, and H. Plattner. Hyrise-R: Scale-out and fragment allocation. While balancing the query load evenly, hot-standby through lazy master replication for our decomposition approach calculates cluster configurations enterprise applications. In IMDM@VLDB, pages with lower memory footprint than state-of-the-art heuris- 7:1–7:7, 2015. tics. We believe that our approach allows integrating further [17] A. Verbitski, A. Gupta, D. Saha, M. Brahmadesam, factors, such as data modification costs, robustness against K. Gupta, R. Mittal, S. Krishnamurthy, S. Maurice, node failures, and economical reallocations. Using LP, we T. Kharatishvili, and X. Bao. Amazon aurora: Design can express these factors as constraints without changing considerations for high throughput cloud-native the structure of the algorithm. In future work, we plan to relational databases. In SIGMOD, pages 1041–1052, not only extend our LP approach, but also demonstrate the 2017. extensions in end-to-end evaluations.