1 Optimization of Incremental Queries in the Cloud József Makai, Gábor Szárnyas, Ákos Horváth, István Ráth, Dániel Varró Fault Tolerant Systems Research Group Department of Measurement and Information Systems Budapest University of Technology and Economics H-1117, Magyar Tudósok krt. 2. Budapest, Hungary jozsef.makai@inf.mit.bme.hu,{szarnyas,ahorvath,rath,varro}@mit.bme.hu Abstract—Database and model queries are the foundations of memory-intensive distributed components to computation of data-driven applications. Their performance is of primary resources according to a strategy that is optimal in terms importance in model-driven software engineering (MDE), es- of overall cost, or favourable in terms of query evaluation pecially with the evergrowing complexity of software modeling projects. To address the scalability issues of traditional MDE performance. We propose to use combinatorial optimization tools, distributed frameworks are being developed – however, methods based on constraint satisfaction programming, aided query optimization in a distributed context brings a whole new by heuristics to estimate i) resource usage of individual set of challenges, including capacity limits of individual nodes system components and ii) network traffic between distributed and network communication. In this paper, we aim to address computation nodes. Our aim is to provide a fully automated the allocation optimization challenge in the context of distributed incremental query evaluation. Our methods are based on a optimization allocation mechanism for I NC Q UERY-D. combination of heuristics-based resource consumption estima- tions and constraint satisfaction programming. We evaluate the impact of the optimization techniques by conducting benchmark II. BACKGROUND measurements. A. The Allocation Problem I. I NTRODUCTION In the context of distributed systems, allocation means the Model-driven software engineering (MDE) plays an impor- assignment of computation resources to computation tasks. tant role in the development processes of critical embedded The input of the allocation consists of the computation tasks systems. With the dramatic increase in complexity that is and the edges between them (which represents data-flow be- also affecting critical embedded systems in recent years, tween two tasks), the available machines with their capacities modeling toolchains are facing scalability challenges as the and possibly the quality of network links between machines. size of design models constantly increases, and automated tool Computation tasks have to be assigned to processes first, features become more sophisticated. as we run processes (Java Virtual Machines specifically for Many scalability issues can be addressed by improving I NC Q UERY-D) on machines, and the resource consumption of query performance. Incremental model queries aim to reduce a process consists of the collective resource consumption of query response time by limiting the impact of model modi- tasks assigned to this particular process. The output of the fications to query result calculation. This technique has been allocation is a possible mapping of processes to machines. proven to improve performance dramatically in several cases 1) Allocation constraints: Allocation constraints are de- (e.g. on-the-fly well-formedness validation or model synchro- rived from the observation that certain subsets of computation nization), at the cost of increased memory consumption. tasks use so much resources together that allocating them to I NC Q UERY-D [1] aims to address the memory consumption a set of machines without sufficient resources would cause a problem by offloading memory-intensive components to a bottleneck in the query evaluation or even cause the failure of distributed architecture. I NC Q UERY-D is a system based on the computation. We call an allocation valid, if the constraints a distributed Rete network [2] that can scale up to handle are satisfied. We assume that the tasks of query evaluation very large models (typically stored in distributed NoSQL or are memory-intensive computations, therefore the allocation graph databases) and complex queries efficiently. constraints consist of memory constraints in our case. The However, the introduction of the cloud and grid-like ar- goal of allocation optimization is to provide valid allocation chitecture introduces a whole set of query optimization chal- optimized for certain optimization targets. lenges. In this paper, we focus on the (static) allocation opti- mization problem, which in our context means the assignment This work was partially supported by the MONDO (EU ICT-611125) B. Optimization Targets project and Lendület programme (MTA-BME Lendület Cyber-Physical Sys- tems Research Group). In this paper, we consider two possible optimization targets. 2 1) Communication Minimization: This objective is sup- III. A LLOCATION O PTIMIZATION IN I NC Q UERY-D posed to increase the performance of query evaluation by A. Formalization of the Allocation Problems reducing the overhead of network communication originated from the data transmission in the distributed system. This is 1) Communication Minimization: For this problem, we mainly useful when time spent with network communication need to represent communication intensity between processes is significant compared to the time of local computations per- and quality of network connection between machines in a formed by tasks, that is true for distributed query evaluation. mathematical model to approximate data transmission time. 2) Cost Minimization: Cost Minimization aims to reduce For communication intensity, we defined our own logical data the monetary cost of computation infrastructure the system unit, the normalized tuple, which is based on the simplification is operated on. This technique is useful for most distributed that all data inside the Rete network is represented by tuples systems, as large systems usually require a lot of resources and made up of identical scalar elements (strings). thus, are expensive to operate, especially when the system runs Definition 1 (Normalized tuple): Given a set of model ele- in a public cloud infrastructure. ments organized in tuples (with the same arity), the normalized tuple is defined as the product of the tuple arity (number of C. I NC Q UERY-D and the Rete Algorithm fields) and the number of tuples. I NC Q UERY-D [1] is a distributed, incremental model query Definition 2 (Overhead multiplier): To describe the quality engine that aims to address scalability issues of incremental of network connections, we use a simplification called over- queries over large models. The high-level architecture of the head multiplier. This value represents all important parameters system is shown in Figure 1. I NC Q UERY-D stores the models of the network connection as a scalar that allows us to compare in distributed databases, as the models can be arbitrarily large. the transmission times associated to identical (logical) volumes A typical I NC Q UERY-D installation uses a distributed graph of data across various connections. database such as 4store1 . The use of normalized tuple and overhead multiplier have been validated by empirical measurements, as shown in Figure Database Cluster 2. We can also observe that transmission time within the same Database Database host is characteristically smaller than between different hosts. shard 1 shard n Query DB Server 1 DB Server n Time to send large data nodes Input Input Input Input 1000000 y = 0.0089x Load and Modify Data node node node R² = 0.9993 100000 Time (Milliseconds) Distributed Indexer Worker Worker Production Worker 10000 nodes Model access adapter node node y = 0.0017x Notifications Worker Worker 1000 R² = 0.9835 Distributed query evaluation node node 100 network nodes Production Production Production node node node 10 Server 1 Server n 1 IncQuery-D Runtime 50000 500000 5000000 50000000 Allocation Production Normalized Tuples Query of Query node Same Host Remote Host Process Resources Linear Regression (Same Host) Linear Regression (Remote Host) Allocation Optimizer ? ? Fig. 2: Measurements of large volume data traffic. Fig. 1: Architecture of I NC Q UERY-D. With these notions, the Communication Minimization prob- lem can be formalized as follows. The input: 1) The Rete Algorithm: The theoretical foundations of • Memory requirements of processes: Vector with the I NC Q UERY-D are provided by the Rete algorithm [2]. For predicted memory consumption of processes. The ith each query, an asynchronous dataflow network of communi- element belongs to the ith process. cating nodes is constructed, consisting of three main types of Rete nodes: i) the indexer layer of I NC Q UERY-D consists of S = [s1 , s2 , · · · , sn ] (1) input nodes that are responsible for caching model element identifiers by type and for propagating change notifications; • Memory capacity of machines: Memory capacity of ma- ii)worker nodes implement relational data operations (such as chines that can be used by the processes. The ith element joins and projections) that make up a query and they also store belongs to the ith machine. the partial results to be propagated to their children nodes; iii) production nodes are responsible for storing and maintaining C = [c1 , c2 , · · · , cm ] (2) the final results of queries, as they reside at the endpoints of the • Communication edges between processes: The matrix E network. Furthermore, they provide an interface for accessing contains the communication intensity between each two query results and result deltas. processes measured in normalized tuples. For the ith 1 http://4store.org/ process the ith row describes these values. 3   e1,1 e1,2 ··· e1,n wm = [w1 , w2 , · · · , wm ] (11)  e2,1 e2,2 ··· e2,n  En,n =  . (3)   .. .. ..  Valid allocations satisfy the same constraints just as for  .. . . .  Communication Minimization: i) (6) ii) (7). en,1 en,2 ··· en,n We include the cost of a machine if and only if at least one • Communication overheads between machines: The matrix process is placed to the machine: O contains the overhead multipliers between machines. n X For the ith machine the ith row describes these values. ∀i ∈ {1, · · · , m} : xi,j ≥ 1 ↔ wi = bi (12)   j=1 o1,1 o1,2 · · · o1,m  o2,1 o2,2 · · · o2,m  As the objective, we minimize the sum of elements in the Om,m =  .  .. ..  ..  (4) w vector:  .. . . .  (m ) om,1 om,2 · · · om,m X cost = min wi , (13) The W matrix contains the calculated communication i=1 weights between processes for valid allocations.   B. Heuristics for Approximating Resource Consumption w1,1 w1,2 · · · w1,n  w2,1 w2,2 · · · w2,n  The lack of exact a-priori knowledge of optimization pa- Wn,n =  .  ..  ..  (5) rameters is a well-known challenge in query optimization,  .. .. . . .  which can be addressed by heuristics-based estimations to wn,1 wn,2 ··· wn,n yield reasonable approximations in an efficient way. In our context, the memory usage of processes is a primary target for Valid allocations have to satisfy: i) for each machine, the such estimations, for both problems. Furthermore, the com- memory capacity cannot be exceeded by the processes munication intensity (i.e. network traffic) between processes s1 · x1,1 + s2 · x1,2 + · · · + sn · x1,n ≤ c1 is necessary for the Communication Minimization problem. 1) Importance of Precise Estimation: If we overestimate s1 · x2,1 + s2 · x2,2 + · · · + sn · x2,n ≤ c2 the memory requirement of a process and reserve much more .. (6) . memory than necessary, then we can not allocate as many processes to a machine as would otherwise be possible, and s1 · xm,1 + s2 · xm,2 + · · · + sn · xm,n ≤ cm thus we can not reach possible better allocations. On the ii) each process must be allocated to exactly one machine. other hand if we underestimate the memory consumption of processes, then we risk either that the processes fail because m X of insufficient amount of allocated memory or we risk the ∀j : xi,j = 1, xi,j ∈ {0, 1} (7) violation of allocation constraints, and thus the overusage i=1 of resources. As a consequence, accurate memory estimation If two processes are placed to particular machines, the com- is required for good resource utilization and for the stable munication intensity has to be multiplied with the overhead operation of the system. value between the machines. For Communication Minimization, the precise estimation of communication intensity is of key importance as it could misdirect the allocation. ∀i, j, k, l, k 6= l : xi,k + xj,l ≥ 2 → wk,l = ek,l · oi,j (8) For the Rete algorithm that operates with collections storing As the objective, we minimize the sum of elements in the static structures (tuples), the memory usage and the communi- W matrix: cation intensity of a component can be estimated using simple   linear regression methods. The independent variable of these   X   linear functions is the estimated number of model elements, weight = min wi,k (9) measured in normalized tuples, stored by the Rete nodes.   1= 2) => w[k][l] == edges[k][l] guage3 . We also illustrate the constraints by an example. * overheads[i][j]; 1) Input representation: Figure 4 shows the available ma- 4 chines with their memory capacities and with the overhead 5 minimize sum(i,j in 1..n) w[i][j]; multipliers between each pair of machines. Constraints for the communication cost between processes Figure 5 shows the sample Rete network with the estima- 1 and 3 with the possibility to be placed to any pair of tions given and already assigned to processes. machines: We have an auxiliary matrix X to describe the allocation constraints: (6) and (7). If xi,j = 1 then the jth process will x1,1 + x1,3 ≥ 2 → w1,3 = 200000 · 1 be allocated to the ith machine and 0 means the opposite. x1,1 + x2,3 ≥ 2 → w1,3 = 200000 · 3 (16) 2 http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer x2,1 + x1,3 ≥ 2 → w1,3 = 200000 · 3 3 http://www-01.ibm.com/software/commerce/optimization/modeling x2,1 + x2,3 ≥ 2 → w1,3 = 200000 · 1 5 5) Solution: Given this input, the CPLEX Optimizer de- B. Results and Analysis termines the overall communication intensity of the optimal 1) Experiment 1: Memory optimization: As it can be seen solution to be 2,200,000 that is shown in Figure 6. in Figure 7, both the “default heap size” variant and the “maximum heap size” variants failed to execute successfully 1 600 MB 2400 MB 2 for the largest instance model sizes, both reporting out of heap Input node Input node 1 × 1,000,000 1 space errors in the JVMs running one of the Rete nodes. 6000 MB As the “default heap size” variant uses 1 GB heap limits, the JVM may get into a thrashing state under such loads and cause 200,000 1,000,000 a timeout during measurement which the Train Benchmark framework registers as a failed run. Similarly, in the case of the Worker node 3 × 200,000 3 × 200,000 “maximum heap size” variant, due to the lack of a reasonable 3 3200 MB upper limit, the JVMs may interfere with each other’s memory 200,000 allocations resulting in a runtime error. Production node 5000 MB 4 500 MB 2 Batch validation time RouteSensor (x,y:logscale), XForm 739.00 Communication = 2,200,000 463.27 Fig. 6: Optimal allocation for the example input. 290.42 Time [s] 182.06 ● 114.13 ● 71.55 IV. P ERFORMANCE I MPACT OF O PTIMIZATION ● 44.85 ● ● ● To justify the beneficial effects of Communication Mini- 28.12 ● ● mization on query performance and stability, we conducted 6k 24k 12k 49k 23k 90k 43k 170k 88k 347k 176k 691k 361k 1M 715k 2M 1M 5M 2M 11M 94 193 348 642 1301 2k 5k 10k 21k 41k measurements using a model validation benchmark, the Train Nodes Benchmark [5], a suite designed for evaluating the perfor- Edges mance of graph pattern matching tools. Results Tools default heap ● maximum heap optimized allocation unoptimized allocation Fig. 7: Runtime of first evaluation of a query on a slow A. Experiments (10 Mbit/s) network. We designed two experiments. In the first experiment, the impact of respecting allocation constraints is assessed by comparing an optimized configuration to a naı̈ve setup that a) Sum of Edit times for query RouteSensor (x,y:logscale), XForm 5033.00 uses the default heap size for the Java Virtual Machines, and b) 2839.40 assigns close to the maximum RAM available to the JVMs. 1601.87 In the second experiment, we assess the impact of network ● Time [ms] ● 903.70 traffic optimization by comparing the optimized configuration 509.83 to a non-optimized setup. To emphasize the effect of network ● ● 287.62 communication on overall performance, we configured the ● 162.26 ● cloud environment to use connections with lower bandwith (10 ● ● Mbit/s) that simulate a system under load. In both experiments, 91.54 6k 12k 23k 43k 88k 176k 361k 715k 1M 2M we run a complex query of the Train Benchmark on instance 24k 94 49k 193 90k 348 170k 642 347k 1301 691k 2k 1M 5k 2M 10k 5M 21k 11M 41k models of increasing sizes, up to 2 million nodes and 11 −9 −19 −34 −64 −130 −260 −532 −1062 −2109 −4176 million edges. Nodes Edges 1) Hardware and software setup: The benchmark software Results Change of result set size configuration consisted of an extended Train Benchmark setup Tools default heap ● maximum heap optimized allocation unoptimized allocation using the 4store (version 1.1.5) database in a clustered envi- ronment. We ran three virtual machines (VMs) on a private Fig. 8: Runtime of the transformation phase on a slow cloud powered by Apache VCL, each VM was running 64- (10 Mbit/s) network. bit Ubuntu Linux 14.04 on Intel Xeon 2.5 GHz CPUs and 8 GBs of RAM. We used Oracle 64-bit Java Virtual Machines 2) Experiment 2: Network optimization: Figure 7 compares (version 1.7.0 72). We integrated our monitoring system into the optimized variant to a “non-optimized” (shown as unopti- the benchmark environment in order to record telemetry data at mized). We may observe that while the overall characteristics each execution stage of the benchmark. The benchmark results are similar, the optimized variant shows a constant-multiplier were automatically processed by the R scripts provided by the advantage, running approximately 15–20% faster than the Train Benchmark framework. unoptimized variant. Figure 8 shows the reevaluation time 6 of a query after changes were applied to the model. In Many different mathematical models exist for the problem the reevaluation phase, the amount of data transmitted is according to the varying requirements of systems, but up to significantly less than in the first evaluation phase. As the our best knowledge the currently proposed model of allocation numbers are comparatively small (which is consistent with the was never formalized and investigated for distributed systems Train Benchmark specification), the advantage of the optimized before, especially in the context of incremental model queries. variant is within the measurement error range. These observations are explained by telemetry data recorded VI. C ONCLUSION by a monitoring system during the measurement. The network We presented an approach for allocation optimization in the traffic measurements are summarized in Figure 9, which context of distributed systems, focusing on two different opti- compares the network traffic volume recorded for the “opti- mization targets: reducing resource usage through minimizing mized” and “unoptimized” variants. It can be seen that while remote network traffic and cost reduction through minimizing the overall volume is practically equivalent, its distribution the amount of computation resources required to evaluate between local and remote links is characteristically different an incremental query. Our approach is based on a mapping (i.e. in the “optimized” case, the overall remote volume is of the allocation optimization problems to the combinatorial approximately 15% lower than in the “unoptimized” case). optimization domain, solved with the state-of-the-art CPLEX Unoptimized Optimized Optimizer. We applied and evaluated this concept to the Traffic'[MBytes] vm0 vm1 vm2 vm0 vm1 vm2 optimization of distributed incremental model queries in the Remote'RX+TX 300 349 371 248 280 347 Local'RX+TX 14 2 74 24 20 190 context of I NC Q UERY-D. SUM'Remote 1020 875 SUM'Local' 90 234 Our primary aim for the future is a technological adap- Total'traffic' 1110 1109 tation of our allocation techniques to the YARN resource Fig. 9: Network traffic statistics. management framework [10]. This is directly motivated by the fact that I NC Q UERY-D itself is moving towards a Hadoop- based implementation. In this context, experience has shown C. Threats to Validity that while YARN has several built-in schedulers for resource allocation, those are not well-suited to long-running, memory- For these experiments, we considered the following internal intensive jobs that are inter-related by complex structural and external threats to validity. As transient and unknown constraints (such as the Rete network). Our long-term goal background load in the cloud environment can introduce noise, is to generalize the CPLEX-based optimization approach to a we performed several execution runs and considered their wider range of Hadoop/YARN applications. minimum value for the result plots as a countermeasure. We strived to avoid systematic errors in the experiments (e.g. incorrect queries or workloads) by cross-checking all results R EFERENCES with the specification of the Train Benchmark. The validity of [1] Szárnyas, G. et al.: IncQuery-D: A Distributed Incremental Model Query Framework in the Cloud. In: ACM/IEEE 17th International Conference the analysis and especially the generalizability of the results on Model Driven Engineering Languages and Systems, 2014, Valencia, to real-world workloads has been thoroughly investigated in Spain, Springer (2014) several academic papers on the Train Benchmark (e.g. [1], [2] Forgy, C.L.: Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial intelligence 19(1) (1982) 17–37 [6]). We believe that our measurements are faithful extensions [3] Makai, J.: Optimization of Incremental Queries in the Cloud. Report. of the Train Benchmark and thus the results of these previous https://tdk.bme.hu/VIK/DownloadPaper/Inkrementalis-lekerdezesek- works apply to our contributions as well. optimalizacioja-a (2014) [4] Van Hentenryck, P., Simonis, H., Dincbas, M.: Constraint satisfaction using constraint logic programming. Artificial intelligence 58(1) (1992) V. R ELATED W ORK 113–159 Based on an idea originating in mathematical economics, [7] [5] Szárnyas, G., Semeráth, O., Ráth, I., Varró, D.: The TTC 2015 Train Benchmark Case for Incremental Model Validation. Transformation Tool proposes a characteristically different performance model Contest, 2015 compared to our domain. In these cases, the cost of communi- [6] Izsó, B., Szatmári, Z., Bergmann, G., Horváth, Á., Ráth, I.: Towards cation comes from the costs of reaching data from other nodes, Precise Metrics for Predicting Graph Query Performance. In 2013 IEEE/ACM 28th International Conference on Automated Software En- which is not applicable for data-driven distributed systems gineering (ASE), pages 412-431, Silicon Valley, CA, USA, 2013. IEEE. such as I NC Q UERY-D. Furthermore, global allocation resource [7] Ferguson, D.F. et al.: 7. In: ECONOMIC MODELS FOR ALLOCAT- constraints are not considered. There are other models [8], ING RESOURCES IN COMPUTER SYSTEMS. (1996) 156–183 [8] Fernández-Baca, D.: Allocating modules to processors in a distributed where the cost can include the execution costs of tasks on system. Software Engineering, IEEE Transactions on 15(11) (1989) particular processors besides the communication cost, which 1427–1436 is a planned future direction of our work. [9] Apers, P.M.G., Hevner, A.R., Yao, S.B.: Optimization algorithms for distributed queries. Software Engineering, IEEE Transactions on (1) Apers et al. developed a scheduling-based optimization ap- (1983) 57–68 proach [9] for distributed queries over relational databases. In [10] Vavilapalli, Vinod Kumar, et al.: ”Apache hadoop yarn: Yet another their model, relational operations are scheduled on computers. resource negotiator.” Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2013 The model uses the estimated time of different operations and data transmission, as it is required for scheduling. However, as relational database queries are not incremental, memory constraints for computers are not considered.