ceur-ws.org/Vol-2399/paper01.pdf


                 Efficient Scale-Out Using Query-Driven
              Workload Distribution and Fragment Allocation

                                                             Stefan Halfpap
                                              Supervised by Prof. Hasso Plattner
                                    Hasso Plattner Institute, University of Potsdam, Germany
                                                        stefan.halfpap@hpi.de

ABSTRACT                                                                    The remainder of this paper is structured as follows: We
Database replication is an approach for scaling throughput               introduce the basic allocation problem in Section 2. In Sec-
and ensuring high availability. Using workload knowledge,                tion 3, we discuss related work. In Section 4, we summarize
we are able to load-balance queries to replica nodes accord-             our scalable decomposition-based approach [9] to calculate
ing to the data being accessed. However, balancing the load              workload distributions for large problem sizes. Further, we
evenly while maximizing data reuse is a challenging alloca-              demonstrate it using a real workload. Our current research
tion problem. To address large-size problems, we developed               and plans for future work are described in Section 5. Sec-
a novel decomposition-based heuristic using linear program-              tion 6 concludes the paper.
ming. We compare our approach with a rule-based state-
of-the-art allocation algorithm using a real-world workload
comprising thousands of queries. Further, we outline how we              2. PROBLEM DESCRIPTION
plan to extend our approach for versatile allocation prob-
                                                                            The allocation problem being examined is a coupled work-
lems, e.g., considering changing workloads and robustness
                                                                         load distribution and data placement problem. We assume
against node failures.
                                                                         a horizontally and/or vertically partitioned database with
                                                                         N disjoint data fragments. Each fragment i has a size ai ,
1.    INTRODUCTION                                                       i = 1, ..., N . Note, we assume a separation of the data par-
   Partitioning and replication are means to allow databases             titioning and the allocation process, which is common [13].
to process increasing workloads. Analyses of real workloads              In this way, existing workload-aware data partitioning algo-
show that read-only queries account for the largest workload             rithms can be used to generate the input fragments for our
share [1, 11]. Scaling read-only queries is relatively simple,           allocation approach.
as we can execute them on read-only replicas without vio-                   Further, we assume a workload consisting of Q queries
lating transactional consistency [6, 16].                                (query classes). Each query class j is defined by the subset of
   Using a naive load-balancing approach, one can distribute             fragments qj ✓ {1, ..., N }, j = 1, ..., Q, it accesses. Queries
queries independently of the accessed data. As a result,                 account for workload shares, defined by query frequencies fj
one has to store all data and apply all data modifications               and query costs cj , j = 1, ..., Q.
on all nodes. Further, queries are unlikely to profit from                  Last, we assume a number of nodes K to load-balance the
caching e↵ects, because similar queries are not guaranteed               workload evenly. In practice, the number of nodes K can be
to be assigned to the same replica.                                      chosen manually by a database administrator or automati-
   In our research, we investigate query-driven workload dis-            cally by a replication framework with regard to the desired
tributions, i.e., load-balancing queries based on their ac-              query throughput.
cessed data. In particular, we reduce the amount of re-                     We want to decide which query should be executed to
quired memory on all nodes, while balancing the load evenly.             which extent on a node in order to minimize the overall
Query-driven workload distributions are also beneficial to               memory consumption for all nodes. Processing query j on
maximize caching e↵ects. Further, when adding new nodes                  node k requires to store all fragments qj on the node. In [9],
to a database cluster, we are able to decide which data to               we derive an LP model to calculate optimal solutions.
load first to quickly process a large share of the workload.                Figure 1 shows an example workload and an optimal so-
   In practice, there are varying constraints and goals to dis-          lution, i.e., an even workload distribution for Q = 5 queries,
tribute the workload and/or required data on nodes. By                   accessing N = 10 fragments, that minimizes the overall
using linear programming (LP) we are able to address ver-                memory consumption for a database cluster with K = 4
satile allocation problems, which include robustness against             nodes. (We assume an equal size for each fragment, i.e.,
potential node failures or efficient reallocations to react to           ai = 1, i = 1, ..., N .)
changing workloads.                                                         By W/V , we denote the replication factor of an alloca-
                                                                         tion, where the total amount of data used W is normalized
                                                                         by the minimal amount of used data V . Because each of the
                                                                         ten fragments is accessed by at least one query, the minimal
                                                                         amount of used data is equal to the number of fragments
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los
Angeles, California. Copyright (C) 2019 for this paper by its authors.   V = N = 10. The replication factor of our exemplary allo-
Copying permitted for private and academic purposes.                     cation for four nodes is W/V = 1.4.
                      Client                                                                     Client

q1   1   2   3    4 10%     q4   8   9 10 20%
q2                                                                                   q4 (100%)                q1 (100%)
     3   4   5    6   15%   q5   1 30%     Allocation           q3 (100%)            q5 (17%)                 q2 (100%)            q5 (83%)
q3   7   8   9 25%
                                                                  25%                  25%                      25%                 25%

                 Database (V=10)                         Node 1 (W=3)         Node 2 (W=4)                Node 3 (W=6)      Node 4 (W=1)
                  1         5                            1           5        1           5               1        5        1          5
                            6                                        6                    6                        6                   6
             2 3 4                                      2 3 4                2 3 4                    2 3 4                2 3 4
                       7 8 9 10                                   7 8 9 10             7 8 9 10                 7 8 9 10            7 8 9 10


Figure 1: Workload-driven fragment allocation (based on [8]). The left-hand side of the figure visualizes the
model input. The database consists of N = 10 fragments. Q = 5 queries correspond to di↵erent workload
shares. Processing a query requires to store its accessed fragments. The objective is to minimize the overall
memory consumption of the replication cluster while evenly balancing the load among K = 4 nodes. The
right-hand side of the figure illustrates an optimal allocation with a total replication factor W/V = 1.4 and
even workload distribution with 1/4 of the workload share assigned to each node.


3.   RELATED WORK                                                               Our research addresses two shortcomings of Rabl and Ja-
   Our workload distribution problem (see Section 2) is an                   cobsen’s approach: (1) When ordering queries and deter-
allocation problem in the field of distributed database sys-                 mining overlap, the algorithm does not consider the specific
tems. Özsu and Valduriez [13] give an overview of related al-               accessed fragments, but only their sizes. (2) When assigning
location problems. We summarize their overview as follows:                   queries to nodes, the remaining queries are not analyzed. In
(i) Constraints and optimization goals, e.g., performance,                   contrast, our decomposition approach (see Section 4) divides
costs, and dependability, for allocation problems di↵er (see                 the problem into subproblems which preserves the structure
also [4]). (ii) Many formulations for allocation problems are                of the problem. Specifically, we regard all queries (whose
proven to be NP-hard [5]. As a result, heuristics have to be                 load we try to divide into workload chunks) with all indi-
used for large problem instances. (iii) As constraints and                   vidual fragments the queries access.
optimization goals di↵er, heuristics are often tied to specific                 Solutions/allocations of our algorithm can be used for
formulations of allocation problems.                                         replicated databases, e.g., SAP HANA [12], Postgres-R [10]
   Our problem formulation is similar to the one presented                   and in replication middleware [2, 3, 14]. Further, the cal-
by Rabl and Jacobsen [15]. We want to balance the load                       culated workload distributions support caching e↵ects for
evenly to nodes to enable linear throughput scaling. Opti-                   systems like Amazon Aurora [17], which separate compute
mal solutions via LP do not scale, e.g., for the 22 TPC-H                    from storage.
queries and using vertical partitioning with each of the 61                     Using LP to solve allocation problems is flexible compared
columns as an individual fragment, we were able to calculate                 to rule-based heuristics. We can add or change constraints,
optimal allocations for up to 8 nodes (termination after 8 h                 and modify the objective function to address varied alloca-
using a current laptop) [9].                                                 tion problems, e.g., requiring to store only a certain subset
   To address large problem sizes, with thousands of queries,                of fragments (instead of all) or demanding a similar mem-
fragments, and dozens of nodes, Rabl and Jacobsen propose                    ory consumption per node. Including such constraints into
a greedy heuristic that assigns queries to nodes one after an-               a rule-based heuristic is more challenging, because it is more
other. In specific, they order queries by the product of their               difficult to decide how and in which part to adapt the algo-
workload share and the overall size of accessed fragments.                   rithm without losing sight of the optimization goal.
Queries are then assigned to the node with the largest frag-
ment overlap with already assigned queries. Nodes with no
assigned queries are treated as if they have a complete over-                4.       DECOMPOSITION APPROACH
lap. If the load of the assigned query exceeds the capacity                     The complexity of our query-driven workload distribu-
of a node, the query is inserted back to the assignment list                 tion problem increases with an increasing number of nodes,
with its remaining load.                                                     queries, and fragments. It is challenging to find good heuris-
   We implemented Rabl and Jacobsen’s algorithm and in-                      tics, as minimizing data redundancy and balancing the load
vestigated the steps chosen during allocations for TPC-H                     on additional replicas are in conflict with each other.
and TPC-DS [8]. We made the following observation: Load                         The core idea of our decomposition heuristic is to split
capacities of nodes are often filled one after another, be-                  the workload iteratively using easier to solve but similar
cause nodes with many assigned queries are more likely to                    subproblems, forming a tree. In specific, we split the work-
have high fragment overlap: Assigning a query to a node                      load into chunks of queries, which access similar fragments.
(may increase the number of allocated fragments and thus)                    Thereby, we reduce the data redundancy in each splitting
increases the probability that the next query is assigned to                 step. In the following, we describe the approach using an
the same node unless the node’s load capacity is exhausted.                  example. In [9], we present the corresponding LP model.
query fragments workload cj x fj
 q1      1   2       3   4   10% ≅ 5 x 20                                                                                         Table 1: Performance comparison: data replica-
                                                                                      q1 (100%)
 q2      3   4       5   6   15% ≅ 50 x 3             Top Node (V=10)
                                                                                      q2 (100%)                                   tion factors of the rule-based heuristic by [15] (W S )
 q3                                                      1        5
         7   8       9       25% ≅ 10 x 25
                                                                  6
                                                                                      q3 (100%)                                   vs. our heuristic (W ), see [9], with one decompo-
 q4                                                                                   q4 (100%)
         8   9 10            20% ≅ 20 x 10
                                                      2 3 4                           q5 (100%)                                   sition level of B chunks and associated nb nodes,
 q5      1
                                                               7 8 9 10
                             30% ≅ 6 x 50
                                                                                       100%                                       b = 1, ..., B, summing up to K.
                                                                                                                                    K      chunks B   nodes nb    W/V     solve time     W/W S
                         Chunk 1 (W=5)                q3 (100%)                         Chunk 2 (W=6)                 q1 (100%)
                                 5                    q4 (100%)                                   5                   q2 (100%)
                                                                                                                                       3      2         2+1        1.81          384 s     -32%
                         1                                                                1
                                 6                    q5 (17%)                                    6                   q5 (83%)         4      2         2+2        2.13          225 s     -42%
                     2 3 4                               50%                          2 3 4                              50%           5      2         3+2        2.60        18 min      -33%
                              7 8 9 10                                                        7 8 9 10
                                                                                                                                       5      3        2+2+1       2.50        99 min      -36%
                                             q4 (100%)                        q1 (100%)                                                6      2         3+3        2.86        13 min      -37%
         q3 (100%)                           q5 (17%)                         q2 (100%)                       q5 (83%)
             25%                               25%                              25%                            25%
      Leaf 1 (W=3)                     Leaf 2 (W=4)                     Leaf 3 (W=6)                    Leaf 4 (W=1)
  1              5                    1           5                    1          5                    1          5                 We used the Gurobi Optimizer [7] with a single thread.
                 6                                6                               6                               6
 2 3 4                               2 3 4
                                                                                                                                  The computation times were between 2 min and 99 min,
                                                                      2 3 4                           2 3 4
             7 8 9 10                          7 8 9 10                         7 8 9 10                       7 8 9 10           while focusing on reducing the replication factor. Di↵erent
                                                                                                                                  chunkings, e.g., 3 + 2 and 2 + 2 + 1, trade computation time
                                                                                                                                  for memory consumption. To reduce the computation time,
Figure 2: Decomposition approach (based on [9]).
                                                                                                                                  we can also cluster fragments or queries, e.g., grouping the
Iteratively splitting a workload for K = 2 + 2 nodes,
                                                                                                                                  4423 queries with lowest costs as a single query class with
leading to an even workload share of 1/4 at all
                                                                                                                                  an aggregated workload share lower than 5%.
leaves. The total replication factor is W/V = 1.4.

                                                                                                                                  5.       CURRENT AND FUTURE WORK
  Figure 2 shows a decomposition of a workload, represented                                                                          Numerical experiments show that our heuristic calculates
by the top node, to K = 2 + 2 = 4 final nodes, represented                                                                        allocations with lower memory consumption than the heuris-
as leaves. Each parent node can have an arbitrary number                                                                          tic by Rabl and Jacobsen [15] for TPC-H [9] and a real-world
of child nodes. The workload share of each child must cor-                                                                        workload (cf. Section 4.1). Currently, we conduct end-to-end
respond to the workload share of the leaves in its subtree.                                                                       evaluations by deploying the calculated fragment allocations
In Figure 2, the top node has two children. Both children                                                                         in a PostgreSQL cluster and measuring the query through-
have two leaves in their subtree. Therefore, their workload                                                                       put. In Section 5.1, we describe our preliminary insights.
share is 2 ⇤ 14 = 12 .                                                                                                               In addition, we plan to investigate the impact of di↵erent
  Using the decomposition approach, we can split the work-                                                                        database fragmentations and query clusterings on the allo-
load into an arbitrary number of final nodes with equal work-                                                                     cation, especially, in the case of skewed model input (cf. Sec-
load shares. Note, the existence of a solution with equal                                                                         tion 4.1). When running workloads in practice, fragment al-
workload shares is guaranteed, because our model (like Rabl                                                                       locations and workload distributions must consider further
and Jacobsen’s) allows to split query loads arbitrarily with-                                                                     factors, e.g., data modification costs, robustness against fail-
out regard to query frequencies and costs, which both are                                                                         ures, or changing model inputs, that are not included in the
discrete (see [9, 15]).                                                                                                           basic allocation problem (cf. Section 2). We describe our
  With each decomposition, the problem gets easier as fewer                                                                       plans to address these factors in the following sections.
relevant queries and accessed fragments have to be consid-                                                                        5.1 End-to-End Evaluations
ered. For chunk 1 in Figure 2, the distribution problem is
shrunken to Q = 3 queries and N = 5 fragments. By choos-                                                                             In practice, workloads consist of a number of active data-
ing the number of child nodes, we can control the problem                                                                         base connections sending a stream of queries. We model
complexity. For large problem instances, it is advisable to                                                                       a workload as a set of queries in a potentially long period
split the top node into a low number of chunks for lowest                                                                         of time. The performance of an allocation in practice de-
computation times. If the problem is still too large, other                                                                       pends on the query timing, especially, whether all (or at
heuristic approaches, e.g., [15], can be used to split the work-                                                                  least many) nodes are used for processing while there are
load close to the top node. Towards the leaves, our approach                                                                      pending queries. This can be supported by query schedul-
can be used.                                                                                                                      ing: in case an incoming query is executable on multiple
                                                                                                                                  nodes, we send the query to the node with the lowest load.
                                                                                                                                     Further, we can calculate allocations that trade memory
4.1           Application to a Real-World Workload                                                                                consumption (allocating additional fragments to nodes) for
   In this section, we demonstrate our heuristic using a real-                                                                    flexibility (increased share of query loads to be processed
world workload. We analyzed queries against an accounting                                                                         by multiple replicas). Flexibility is also beneficial to handle
table, comprising of N = 344 columns/fragments. We ex-                                                                            imprecise query costs, e.g., caused by concurrency e↵ects at
tracted Q = 4461 query classes, whose query costs are dis-                                                                        runtime. Currently, we use average query execution times
tributed exponentially. In particular, 38 queries account for                                                                     as query costs metric, because they are easy to obtain and
more than 95% of the workload share.                                                                                              widely applicable.
   We compare our solution to the greedy heuristic by Rabl
and Jacobsen [15]. Table 1 summarizes the results for 3                                                                          5.2 Data Modifications
K  6. Our solution reduces the replication factor in com-                                                                          As a result of inserts, updates, or deletes, data may change
parison to the greedy heuristic by 32-42%.                                                                                        over time. We can adapt our basic model to include data
modification costs. We are able to model update costs ei-          7.   REFERENCES
ther as execution costs similar to costs for read queries (see      [1] M. Boissier. Optimizing main memory utilization of
also [15]), or as fragment modification costs depending on              columnar in-memory databases using data eviction. In
the stored fragments per node, both in a linear way. Inte-              PhD@VLDB, pages 1–6, 2014.
grating modification costs for the decomposition heuristic is       [2] E. Cecchet, G. Candea, and A. Ailamaki.
also possible. When splitting the workload into leaf nodes,             Middleware-based database replication: the gaps
we can include update costs precisely for node allocations.             between theory and practice. In SIGMOD, pages
In upper levels, we may have to approximate update costs,               739–752, 2008.
because the exact number of leaf nodes for which update
                                                                    [3] E. Cecchet, J. Marguerite, and W. Zwaenepoel.
costs occur may be unknown.
                                                                        C-JDBC: Flexible database clustering middleware. In
5.3    Node Failures                                                    FREENIX@USENIX, pages 9–18, 2004.
                                                                    [4] L. W. Dowdy and D. V. Foster. Comparative models
   Besides scalability, data replication enables high availabil-
                                                                        of the file assignment problem. ACM Comput. Surv.,
ity. Thereby, a database cluster can support di↵erent levels
                                                                        14(2):287–313, 1982.
of robustness in case of node failures. Basic robust alloca-
tions ensure that each fragment is stored on multiple nodes,        [5] K. P. Eswaran. Placement of records in a file and file
or queries can be processed on multiple nodes [15]. How-                allocation in a computer. In IFIP, pages 304–307,
ever, the workload distribution after node failures can be              1974.
highly skewed. In future work, we will investigate alloca-          [6] J. Gray, P. Helland, P. E. O’Neil, and D. Shasha. The
tions that allow an evenly balanced workload, even in the               dangers of replication and a solution. In SIGMOD,
case of potential node failures.                                        pages 173–182, 1996.
                                                                    [7] Gurobi Optimization. https://www.gurobi.com.
5.4    Reallocation Costs                                           [8] S. Halfpap and R. Schlosser. A comparison of
   Workloads (including query costs) or fragment sizes may              allocation algorithms for partially replicated
change. As a result, a current data allocation may not al-              databases. In ICDE, pages 2008–2011, 2019.
low an even workload distribution anymore or a di↵erent             [9] S. Halfpap and R. Schlosser. Workload-driven
allocation may reduce the memory consumption.                           fragment allocation for partially replicated databases
   If workload changes are known in advance, Rabl and Ja-               using linear programming. In ICDE, pages 1746–1749,
cobsen propose to calculate allocations for each scenario and           2019.
merge all of the allocations to a combined allocation that is      [10] B. Kemme and G. Alonso. Don’t be lazy, be
robust with regard to the workload changes [15]. An alter-              consistent: Postgres-R, A new way to implement
native approach may reallocate fragments to match the new               database replication. In VLDB, pages 134–143, 2000.
workload. To avoid costly reallocations, we currently in-          [11] J. Krüger, C. Kim, M. Grund, N. Satish, D. Schwalb,
vestigate how to take the current allocation into account in            J. Chhugani, H. Plattner, P. Dubey, and A. Zeier.
the algorithm. To avoid frequent reallocations, allocations             Fast updates on read-optimized databases using
should be robust with regard to minor workload changes                  multi-core CPUs. PVLDB, 5(1):61–72, 2011.
(see Section 5.1).                                                 [12] J. Lee, S. Moon, K. H. Kim, D. H. Kim, S. K. Cha,
                                                                        W. Han, C. G. Park, H. J. Na, and J. Lee. Parallel
5.5    Online Approach                                                  replication across formats in SAP HANA for scaling
  As workloads and underlying data, and thus model in-                  out mixed OLTP/OLAP workloads. PVLDB,
puts change, we want to adapt allocations over time, ideally            10(12):1598–1609, 2017.
without downtime or performance degradation. The previ-
                                                                   [13] M. T. Özsu and P. Valduriez. Principles of Distributed
ous sections described building blocks to implement a fault-
                                                                        Database Systems, Third Edition. Springer, 2011.
tolerant and adaptive replication cluster. We can monitor
the workload to detect when we have to adapt a current al-         [14] M. Patiño-Martı́nez, R. Jiménez-Peris, B. Kemme,
location. At that time we can calculate a new allocation,               and G. Alonso. MIDDLE-R: consistent database
which is better suited to the current workload and which                replication at the middleware level. ACM Trans.
can be created with reasonable reallocation costs.                      Comput. Syst., 23(4):375–423, 2005.
                                                                   [15] T. Rabl and H. Jacobsen. Query centric partitioning
                                                                        and allocation for partially replicated database
6.    CONCLUSION                                                        systems. In SIGMOD, pages 315–330, 2017.
   We presented the current status and plans to extend our         [16] D. Schwalb, J. Kossmann, M. Faust, S. Klauck,
work in the field of query-driven workload distribution and             M. Uflacker, and H. Plattner. Hyrise-R: Scale-out and
fragment allocation. While balancing the query load evenly,             hot-standby through lazy master replication for
our decomposition approach calculates cluster configurations            enterprise applications. In IMDM@VLDB, pages
with lower memory footprint than state-of-the-art heuris-               7:1–7:7, 2015.
tics. We believe that our approach allows integrating further      [17] A. Verbitski, A. Gupta, D. Saha, M. Brahmadesam,
factors, such as data modification costs, robustness against            K. Gupta, R. Mittal, S. Krishnamurthy, S. Maurice,
node failures, and economical reallocations. Using LP, we               T. Kharatishvili, and X. Bao. Amazon aurora: Design
can express these factors as constraints without changing               considerations for high throughput cloud-native
the structure of the algorithm. In future work, we plan to              relational databases. In SIGMOD, pages 1041–1052,
not only extend our LP approach, but also demonstrate the               2017.
extensions in end-to-end evaluations.