-

Optimization of Incremental Queries in the Cloud

0 J o ́zsef Makai , Ga ́bor Sza ́rnyas, A

-Database and model queries are the foundations of data-driven applications. Their performance is of primary importance in model-driven software engineering (MDE), especially with the evergrowing complexity of software modeling projects. To address the scalability issues of traditional MDE tools, distributed frameworks are being developed - however, query optimization in a distributed context brings a whole new set of challenges, including capacity limits of individual nodes and network communication. In this paper, we aim to address the allocation optimization challenge in the context of distributed incremental query evaluation. Our methods are based on a combination of heuristics-based resource consumption estimations and constraint satisfaction programming. We evaluate the impact of the optimization techniques by conducting benchmark measurements.

I. INTRODUCTION

Model-driven software engineering (MDE) plays an important role in the development processes of critical embedded systems. With the dramatic increase in complexity that is also affecting critical embedded systems in recent years, modeling toolchains are facing scalability challenges as the size of design models constantly increases, and automated tool features become more sophisticated.

Many scalability issues can be addressed by improving query performance. Incremental model queries aim to reduce query response time by limiting the impact of model modifications to query result calculation. This technique has been proven to improve performance dramatically in several cases (e.g. on-the-fly well-formedness validation or model synchronization), at the cost of increased memory consumption.

INCQUERY-D [1] aims to address the memory consumption problem by offloading memory-intensive components to a distributed architecture. INCQUERY-D is a system based on a distributed Rete network [2] that can scale up to handle very large models (typically stored in distributed NoSQL or graph databases) and complex queries efficiently.

However, the introduction of the cloud and grid-like architecture introduces a whole set of query optimization challenges. In this paper, we focus on the (static) allocation optimization problem, which in our context means the assignment

This work was partially supported by the MONDO (EU ICT-611125) project and Lendu¨let programme (MTA-BME Lendu¨let Cyber-Physical Systems Research Group). of memory-intensive distributed components to computation resources according to a strategy that is optimal in terms of overall cost, or favourable in terms of query evaluation performance. We propose to use combinatorial optimization methods based on constraint satisfaction programming, aided by heuristics to estimate i) resource usage of individual system components and ii) network traffic between distributed computation nodes. Our aim is to provide a fully automated optimization allocation mechanism for INCQUERY-D.

II. BACKGROUND A. The Allocation Problem

In the context of distributed systems, allocation means the assignment of computation resources to computation tasks. The input of the allocation consists of the computation tasks and the edges between them (which represents data-flow between two tasks), the available machines with their capacities and possibly the quality of network links between machines. Computation tasks have to be assigned to processes first, as we run processes (Java Virtual Machines specifically for INCQUERY-D) on machines, and the resource consumption of a process consists of the collective resource consumption of tasks assigned to this particular process. The output of the allocation is a possible mapping of processes to machines.

1) Allocation constraints: Allocation constraints are derived from the observation that certain subsets of computation tasks use so much resources together that allocating them to a set of machines without sufficient resources would cause a bottleneck in the query evaluation or even cause the failure of the computation. We call an allocation valid, if the constraints are satisfied. We assume that the tasks of query evaluation are memory-intensive computations, therefore the allocation constraints consist of memory constraints in our case. The goal of allocation optimization is to provide valid allocation optimized for certain optimization targets.

In this paper, we consider two possible optimization targets. 1) Communication Minimization: This objective is supposed to increase the performance of query evaluation by reducing the overhead of network communication originated from the data transmission in the distributed system. This is mainly useful when time spent with network communication is significant compared to the time of local computations performed by tasks, that is true for distributed query evaluation.

2) Cost Minimization: Cost Minimization aims to reduce the monetary cost of computation infrastructure the system is operated on. This technique is useful for most distributed systems, as large systems usually require a lot of resources and thus, are expensive to operate, especially when the system runs in a public cloud infrastructure.

C. INCQUERY-D and the Rete Algorithm

INCQUERY-D [1] is a distributed, incremental model query engine that aims to address scalability issues of incremental queries over large models. The high-level architecture of the system is shown in Figure 1. INCQUERY-D stores the models in distributed databases, as the models can be arbitrarily large. A typical INCQUERY-D installation uses a distributed graph database such as 4store1.

Database Cluster Database shard 1

DB Server 1

Database shard n

DB Server n Load and Modify Data

1) The Rete Algorithm: The theoretical foundations of INCQUERY-D are provided by the Rete algorithm [2]. For each query, an asynchronous dataflow network of communicating nodes is constructed, consisting of three main types of Rete nodes: i) the indexer layer of INCQUERY-D consists of input nodes that are responsible for caching model element identifiers by type and for propagating change notifications; ii)worker nodes implement relational data operations (such as joins and projections) that make up a query and they also store the partial results to be propagated to their children nodes; iii) production nodes are responsible for storing and maintaining the final results of queries, as they reside at the endpoints of the network. Furthermore, they provide an interface for accessing query results and result deltas.

1http://4store.org/

III. ALLOCATION OPTIMIZATION IN INCQUERY-D

A. Formalization of the Allocation Problems 1) Communication Minimization: For this problem, we

need to represent communication intensity between processes and quality of network connection between machines in a mathematical model to approximate data transmission time. For communication intensity, we defined our own logical data unit, the normalized tuple, which is based on the simplification that all data inside the Rete network is represented by tuples made up of identical scalar elements (strings).

Definition 1 (Normalized tuple): Given a set of model ele

ments organized in tuples (with the same arity), the normalized tuple is defined as the product of the tuple arity (number of fields) and the number of tuples.

Definition 2 (Overhead multiplier): To describe the quality

of network connections, we use a simplification called overhead multiplier. This value represents all important parameters of the network connection as a scalar that allows us to compare the transmission times associated to identical (logical) volumes of data across various connections.

The use of normalized tuple and overhead multiplier have

been validated by empirical measurements, as shown in Figure 2. We can also observe that transmission time within the same host is characteristically smaller than between different hosts.

Time to send large data Communication edges between processes: The matrix E contains the communication intensity between each two processes measured in normalized tuples. For the ith process the ith row describes these values. ( 1 ) ( 2 )

The W matrix contains the calculated communication weights between processes for valid allocations.

Valid allocations have to satisfy: i) for each machine, the memory capacity cannot be exceeded by the processes Communication overheads between machines: The matrix O contains the overhead multipliers between machines.

For the ith machine the ith row describes these values.

The vector w contains the cost of each machine in the system, the cost of the machine if used, otherwise 0.

If two processes are placed to particular machines, the communication intensity has to be multiplied with the overhead value between the machines.

8i; j; k; l; k 6= l : xi;k + xj;l 2 ! wk;l = ek;l oi;j ( 8 )

As the objective, we minimize the sum of elements in the W matrix:

m 8j : X xi;j = 1; xi;j 2 f0; 1g i=1

8 weight = min >< X >:11<<ki<<nn wi;k 9 > = > ; s1 x1;1 + s2 x1;2 + s1 x2;1 + s2 x2;2 + + sn x1;n + sn x2;n s1 xm;1 + s2 xm;2 + + sn xm;n . . .

c1 c2 cm ii) each process must be allocated to exactly one machine.

2) Cost Minimization: The input:

Memory requirements ( 1 ) and memory capacities ( 2 ) are same as for Communication Minimization.

Cost of machines: The monetary cost of machines.

B = [b1; b2; ; bm] 8i 2 f1;

n ; mg : X xi;j j=1 1 $ wi = bi (12)

As the objective, we minimize the sum of elements in the w vector: cost = min ( m )

X wi ; i=1 ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 9 ) ( 10 ) (11) (13)

The lack of exact a-priori knowledge of optimization parameters is a well-known challenge in query optimization, which can be addressed by heuristics-based estimations to yield reasonable approximations in an efficient way. In our context, the memory usage of processes is a primary target for such estimations, for both problems. Furthermore, the communication intensity (i.e. network traffic) between processes is necessary for the Communication Minimization problem.

1) Importance of Precise Estimation: If we overestimate the memory requirement of a process and reserve much more memory than necessary, then we can not allocate as many processes to a machine as would otherwise be possible, and thus we can not reach possible better allocations. On the other hand if we underestimate the memory consumption of processes, then we risk either that the processes fail because of insufficient amount of allocated memory or we risk the violation of allocation constraints, and thus the overusage of resources. As a consequence, accurate memory estimation is required for good resource utilization and for the stable operation of the system.

For Communication Minimization, the precise estimation of communication intensity is of key importance as it could misdirect the allocation.

For the Rete algorithm that operates with collections storing static structures (tuples), the memory usage and the communication intensity of a component can be estimated using simple linear regression methods. The independent variable of these linear functions is the estimated number of model elements, measured in normalized tuples, stored by the Rete nodes.

2) Calculation and Propagation of Estimations: The core idea is to estimate the amount of outbound network traffic from a Rete node by the amount of data it contains, as in the initialization phase of the network, the entirety of this data will be transmitted. On the other hand, data storing Rete nodes must store all the data, that is propagated by their parent nodes.

This principle can be applied and propagated through the complete Rete structure using breadth-first search (BFS). Figure 3 shows an example where we marked which iteration the different estimations can be calculated in. The white circle means the memory estimation and the black circle means the communication intensity estimation. After the level of input nodes (for those the amount of stored and so the propagated data can be known exactly, as those propagate all their input data), the estimations for their direct children nodes can be calculated in the second iteration and so on. It is important to note that we can not calculate estimations for a node until the estimations are calculated for all its parent nodes. 5000 MB 2400 MB 2 1

Worker node 3 4 Production node Fig. 5: Rete node and Processes with estimations.

The estimations for different types of Rete nodes are calculated following different rules that were refined in empirical experiments carried out with typical model-driven engineering workloads and models (sparse graphs). We present a few examples for these rules, for more examples see [3].

When i) joining nodes of the model through edges, we 3) 1-to-1 Allocation: Constraint ( 7 ) in OPL: expect to have 1% of the input set size (Cartesian product 1 forall (j in 1..n) of the two input sets) as result. For ii) selection nodes (which 2 sum(i in 1..m) filters elements according to criteria), we expect to have 10% 3 x[i][j] == 1; of input elements as result, as this is a good upper-bound The constraint for process 1 is: estimation for usual queries. 600 x1;1 + 2400 x1;2 + 3200 x1;3 + 500 x1;4 The constraint is the following for machine 1 :

C. Solution by Constraint Satisfaction

We proved both the Communication and Cost Minimization problems to be N P-hard [3], therefore we propose to solve the problems by constraint satisfaction programming (CSP) [4] using the IBM CPLEX Optimizer2 as the platform of implementation. We show how the Communication Optimization 1 forall (i,j in 1..m) problem can be formalized with CPLEX and the OPL lan- 2 forall (k,l in 1..n) guage3. We also illustrate the constraints by an example. 3 (x[i][*k]o+vxe[rjh]e[ald]s[>i=][2j)];=> w[k][l] == edges[k][l] 1) Input representation: Figure 4 shows the available ma- 4 chines with their memory capacities and with the overhead 5 minimize sum(i,j in 1..n) w[i][j]; multipliers between each pair of machines. Constraints for the communication cost between processes

Figure 5 shows the sample Rete network with the estima- 1 and 3 with the possibility to be placed to any pair of tions given and already assigned to processes. machines:

We have an auxiliary matrix X to describe the allocation constraints: ( 6 ) and ( 7 ). If xi;j = 1 then the jth process will x1;1 + x1;3 2 ! w1;3 = 200000 1 be allocated to the ith machine and 0 means the opposite.

4) Objective Function: The OPL code for ( 8 ) and ( 9 ) constraints, where the edges matrix (n n) contains the communication intensity values between every two processes and the overheads matrix (m m) contains the overhead multipliers between machines. 2http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer 3http://www-01.ibm.com/software/commerce/optimization/modeling 5) Solution: Given this input, the CPLEX Optimizer determines the overall communication intensity of the optimal solution to be 2,200,000 that is shown in Figure 6.

Worker node

3200 MB

Production node 500 MB

To justify the beneficial effects of Communication Minimization on query performance and stability, we conducted measurements using a model validation benchmark, the Train Benchmark [5], a suite designed for evaluating the performance of graph pattern matching tools.

A. Experiments

We designed two experiments. In the first experiment, the impact of respecting allocation constraints is assessed by comparing an optimized configuration to a na¨ıve setup that a) uses the default heap size for the Java Virtual Machines, and b) assigns close to the maximum RAM available to the JVMs. In the second experiment, we assess the impact of network traffic optimization by comparing the optimized configuration to a non-optimized setup. To emphasize the effect of network communication on overall performance, we configured the cloud environment to use connections with lower bandwith (10 Mbit/s) that simulate a system under load. In both experiments, we run a complex query of the Train Benchmark on instance models of increasing sizes, up to 2 million nodes and 11 million edges.

1) Hardware and software setup: The benchmark software configuration consisted of an extended Train Benchmark setup using the 4store (version 1.1.5) database in a clustered environment. We ran three virtual machines (VMs) on a private cloud powered by Apache VCL, each VM was running 64bit Ubuntu Linux 14.04 on Intel Xeon 2.5 GHz CPUs and 8 GBs of RAM. We used Oracle 64-bit Java Virtual Machines (version 1.7.0 72). We integrated our monitoring system into the benchmark environment in order to record telemetry data at each execution stage of the benchmark. The benchmark results were automatically processed by the R scripts provided by the Train Benchmark framework.

B. Results and Analysis 1) Experiment 1: Memory optimization: As it can be seen

in Figure 7, both the “default heap size” variant and the “maximum heap size” variants failed to execute successfully for the largest instance model sizes, both reporting out of heap space errors in the JVMs running one of the Rete nodes.

As the “default heap size” variant uses 1 GB heap limits, the JVM may get into a thrashing state under such loads and cause a timeout during measurement which the Train Benchmark framework registers as a failed run. Similarly, in the case of the “maximum heap size” variant, due to the lack of a reasonable upper limit, the JVMs may interfere with each other’s memory allocations resulting in a runtime error.

Batch validation time RouteSensor (x,y:logscale), XForm

2) Experiment 2: Network optimization: Figure 7 compares the optimized variant to a “non-optimized” (shown as unoptimized). We may observe that while the overall characteristics are similar, the optimized variant shows a constant-multiplier advantage, running approximately 15–20% faster than the unoptimized variant. Figure 8 shows the reevaluation time of a query after changes were applied to the model. In the reevaluation phase, the amount of data transmitted is significantly less than in the first evaluation phase. As the numbers are comparatively small (which is consistent with the Train Benchmark specification), the advantage of the optimized variant is within the measurement error range.

These observations are explained by telemetry data recorded by a monitoring system during the measurement. The network traffic measurements are summarized in Figure 9, which compares the network traffic volume recorded for the “optimized” and “unoptimized” variants. It can be seen that while the overall volume is practically equivalent, its distribution between local and remote links is characteristically different (i.e. in the “optimized” case, the overall remote volume is approximately 15% lower than in the “unoptimized” case).

Traffic'[MBytes] Remote'RX+TX Local'RX+TX SUM'Remote SUM'Local' Total'traffic'

Fig. 9: Network traffic statistics.

C. Threats to Validity

For these experiments, we considered the following internal and external threats to validity. As transient and unknown background load in the cloud environment can introduce noise, we performed several execution runs and considered their minimum value for the result plots as a countermeasure. We strived to avoid systematic errors in the experiments (e.g. incorrect queries or workloads) by cross-checking all results with the specification of the Train Benchmark. The validity of the analysis and especially the generalizability of the results to real-world workloads has been thoroughly investigated in several academic papers on the Train Benchmark (e.g. [1], [6]). We believe that our measurements are faithful extensions of the Train Benchmark and thus the results of these previous works apply to our contributions as well.

V. RELATED WORK

Based on an idea originating in mathematical economics, [7] proposes a characteristically different performance model compared to our domain. In these cases, the cost of communication comes from the costs of reaching data from other nodes, which is not applicable for data-driven distributed systems such as INCQUERY-D. Furthermore, global allocation resource constraints are not considered. There are other models [8], where the cost can include the execution costs of tasks on particular processors besides the communication cost, which is a planned future direction of our work.

Apers et al. developed a scheduling-based optimization approach [9] for distributed queries over relational databases. In their model, relational operations are scheduled on computers. The model uses the estimated time of different operations and data transmission, as it is required for scheduling. However, as relational database queries are not incremental, memory constraints for computers are not considered.

Many different mathematical models exist for the problem according to the varying requirements of systems, but up to our best knowledge the currently proposed model of allocation was never formalized and investigated for distributed systems before, especially in the context of incremental model queries.

VI. CONCLUSION

We presented an approach for allocation optimization in the context of distributed systems, focusing on two different optimization targets: reducing resource usage through minimizing remote network traffic and cost reduction through minimizing the amount of computation resources required to evaluate an incremental query. Our approach is based on a mapping of the allocation optimization problems to the combinatorial optimization domain, solved with the state-of-the-art CPLEX Optimizer. We applied and evaluated this concept to the optimization of distributed incremental model queries in the context of INCQUERY-D.

Our primary aim for the future is a technological adaptation of our allocation techniques to the YARN resource management framework [10]. This is directly motivated by the fact that INCQUERY-D itself is moving towards a Hadoopbased implementation. In this context, experience has shown that while YARN has several built-in schedulers for resource allocation, those are not well-suited to long-running, memoryintensive jobs that are inter-related by complex structural constraints (such as the Rete network). Our long-term goal is to generalize the CPLEX-based optimization approach to a wider range of Hadoop/YARN applications.

[1] Sza´rnyas, G. et al.: IncQuery-D: A Distributed Incremental Model Query Framework in the Cloud . In: ACM/IEEE 17th International Conference on Model Driven Engineering Languages and Systems , 2014, Valencia, Spain, Springer ( 2014 )

[2] Forgy , C.L. : Rete: A fast algorithm for the many pattern/many object pattern match problem . Artificial intelligence 19(1) ( 1982 ) 17 - 37

[3] Makai , J. : Optimization of Incremental Queries in the Cloud . Report . https://tdk.bme.hu/VIK/DownloadPaper/Inkrementalis-lekerdezesekoptimalizacioja-a ( 2014 )

[4]

Van

Hentenryck , P. , Simonis , H. , Dincbas , M.: Constraint satisfaction using constraint logic programming . Artificial intelligence 58(1) ( 1992 ) 113 - 159

[5] Sza´rnyas, G., Semera´th, O. , Ra´th, I. , Varro´, D. : The TTC 2015 Train Benchmark Case for Incremental Model Validation . Transformation Tool Contest, 2015

[6]

Izso

´ , B. , Szatma´ri, Z. , Bergmann , G. , Horva´th, A´., Ra´th, I.: Towards Precise Metrics for Predicting Graph Query Performance . In 2013 IEEE/ACM 28th International Conference on Automated Software Engineering (ASE) , pages 412 - 431 , Silicon Valley, CA, USA, 2013 . IEEE.

[7] Ferguson , D.F. et al.: 7. In: ECONOMIC MODELS FOR ALLOCATING RESOURCES IN COMPUTER SYSTEMS . ( 1996 ) 156 - 183

[8] Ferna

´ndez-

Baca , D. : Allocating modules to processors in a distributed system . Software Engineering , IEEE Transactions on 15 ( 11 ) ( 1989 ) 1427 - 1436

[9] Apers , P.M.G. , Hevner , A.R. , Yao , S.B. : Optimization algorithms for distributed queries . Software Engineering, IEEE Transactions on (1) ( 1983 ) 57 - 68

[10] Vavilapalli , Vinod Kumar , et al.: ”Apache hadoop yarn: Yet another resource negotiator . ” Proceedings of the 4th annual Symposium on Cloud Computing. ACM , 2013