=Paper=
{{Paper
|id=Vol-1563/paper1
|storemode=property
|title=Optimization of Incremental Queries in the Cloud
|pdfUrl=https://ceur-ws.org/Vol-1563/paper1.pdf
|volume=Vol-1563
|authors=Jozsef Makai,Gabor Szarnyas,Istvan Rath,Akos Horvath,Daniel Varro
|dblpUrl=https://dblp.org/rec/conf/models/MakaiSRHV15
}}
==Optimization of Incremental Queries in the Cloud==
1
Optimization of Incremental Queries
in the Cloud
József Makai, Gábor Szárnyas, Ákos Horváth, István Ráth, Dániel Varró
Fault Tolerant Systems Research Group
Department of Measurement and Information Systems
Budapest University of Technology and Economics
H-1117, Magyar Tudósok krt. 2.
Budapest, Hungary
jozsef.makai@inf.mit.bme.hu,{szarnyas,ahorvath,rath,varro}@mit.bme.hu
Abstract—Database and model queries are the foundations of memory-intensive distributed components to computation
of data-driven applications. Their performance is of primary resources according to a strategy that is optimal in terms
importance in model-driven software engineering (MDE), es- of overall cost, or favourable in terms of query evaluation
pecially with the evergrowing complexity of software modeling
projects. To address the scalability issues of traditional MDE performance. We propose to use combinatorial optimization
tools, distributed frameworks are being developed – however, methods based on constraint satisfaction programming, aided
query optimization in a distributed context brings a whole new by heuristics to estimate i) resource usage of individual
set of challenges, including capacity limits of individual nodes system components and ii) network traffic between distributed
and network communication. In this paper, we aim to address computation nodes. Our aim is to provide a fully automated
the allocation optimization challenge in the context of distributed
incremental query evaluation. Our methods are based on a optimization allocation mechanism for I NC Q UERY-D.
combination of heuristics-based resource consumption estima-
tions and constraint satisfaction programming. We evaluate the
impact of the optimization techniques by conducting benchmark II. BACKGROUND
measurements.
A. The Allocation Problem
I. I NTRODUCTION In the context of distributed systems, allocation means the
Model-driven software engineering (MDE) plays an impor- assignment of computation resources to computation tasks.
tant role in the development processes of critical embedded The input of the allocation consists of the computation tasks
systems. With the dramatic increase in complexity that is and the edges between them (which represents data-flow be-
also affecting critical embedded systems in recent years, tween two tasks), the available machines with their capacities
modeling toolchains are facing scalability challenges as the and possibly the quality of network links between machines.
size of design models constantly increases, and automated tool Computation tasks have to be assigned to processes first,
features become more sophisticated. as we run processes (Java Virtual Machines specifically for
Many scalability issues can be addressed by improving I NC Q UERY-D) on machines, and the resource consumption of
query performance. Incremental model queries aim to reduce a process consists of the collective resource consumption of
query response time by limiting the impact of model modi- tasks assigned to this particular process. The output of the
fications to query result calculation. This technique has been allocation is a possible mapping of processes to machines.
proven to improve performance dramatically in several cases 1) Allocation constraints: Allocation constraints are de-
(e.g. on-the-fly well-formedness validation or model synchro- rived from the observation that certain subsets of computation
nization), at the cost of increased memory consumption. tasks use so much resources together that allocating them to
I NC Q UERY-D [1] aims to address the memory consumption a set of machines without sufficient resources would cause a
problem by offloading memory-intensive components to a bottleneck in the query evaluation or even cause the failure of
distributed architecture. I NC Q UERY-D is a system based on the computation. We call an allocation valid, if the constraints
a distributed Rete network [2] that can scale up to handle are satisfied. We assume that the tasks of query evaluation
very large models (typically stored in distributed NoSQL or are memory-intensive computations, therefore the allocation
graph databases) and complex queries efficiently. constraints consist of memory constraints in our case. The
However, the introduction of the cloud and grid-like ar- goal of allocation optimization is to provide valid allocation
chitecture introduces a whole set of query optimization chal- optimized for certain optimization targets.
lenges. In this paper, we focus on the (static) allocation opti-
mization problem, which in our context means the assignment
This work was partially supported by the MONDO (EU ICT-611125) B. Optimization Targets
project and Lendület programme (MTA-BME Lendület Cyber-Physical Sys-
tems Research Group). In this paper, we consider two possible optimization targets.
2
1) Communication Minimization: This objective is sup- III. A LLOCATION O PTIMIZATION IN I NC Q UERY-D
posed to increase the performance of query evaluation by A. Formalization of the Allocation Problems
reducing the overhead of network communication originated
from the data transmission in the distributed system. This is 1) Communication Minimization: For this problem, we
mainly useful when time spent with network communication need to represent communication intensity between processes
is significant compared to the time of local computations per- and quality of network connection between machines in a
formed by tasks, that is true for distributed query evaluation. mathematical model to approximate data transmission time.
2) Cost Minimization: Cost Minimization aims to reduce For communication intensity, we defined our own logical data
the monetary cost of computation infrastructure the system unit, the normalized tuple, which is based on the simplification
is operated on. This technique is useful for most distributed that all data inside the Rete network is represented by tuples
systems, as large systems usually require a lot of resources and made up of identical scalar elements (strings).
thus, are expensive to operate, especially when the system runs Definition 1 (Normalized tuple): Given a set of model ele-
in a public cloud infrastructure. ments organized in tuples (with the same arity), the normalized
tuple is defined as the product of the tuple arity (number of
C. I NC Q UERY-D and the Rete Algorithm fields) and the number of tuples.
I NC Q UERY-D [1] is a distributed, incremental model query Definition 2 (Overhead multiplier): To describe the quality
engine that aims to address scalability issues of incremental of network connections, we use a simplification called over-
queries over large models. The high-level architecture of the head multiplier. This value represents all important parameters
system is shown in Figure 1. I NC Q UERY-D stores the models of the network connection as a scalar that allows us to compare
in distributed databases, as the models can be arbitrarily large. the transmission times associated to identical (logical) volumes
A typical I NC Q UERY-D installation uses a distributed graph of data across various connections.
database such as 4store1 . The use of normalized tuple and overhead multiplier have
been validated by empirical measurements, as shown in Figure
Database Cluster 2. We can also observe that transmission time within the same
Database Database host is characteristically smaller than between different hosts.
shard 1 shard n Query
DB Server 1 DB Server n
Time to send large data
nodes
Input
Input Input Input 1000000 y = 0.0089x
Load and Modify Data node node node
R² = 0.9993
100000
Time (Milliseconds)
Distributed Indexer Worker Worker
Production Worker
10000
nodes
Model access adapter node node
y = 0.0017x
Notifications Worker Worker 1000
R² = 0.9835
Distributed query evaluation node node
100
network
nodes
Production Production Production
node node node 10
Server 1 Server n
1
IncQuery-D Runtime
50000 500000 5000000 50000000
Allocation Production Normalized Tuples
Query of Query node Same Host Remote Host
Process
Resources Linear Regression (Same Host) Linear Regression (Remote Host)
Allocation Optimizer ? ?
Fig. 2: Measurements of large volume data traffic.
Fig. 1: Architecture of I NC Q UERY-D. With these notions, the Communication Minimization prob-
lem can be formalized as follows. The input:
1) The Rete Algorithm: The theoretical foundations of • Memory requirements of processes: Vector with the
I NC Q UERY-D are provided by the Rete algorithm [2]. For predicted memory consumption of processes. The ith
each query, an asynchronous dataflow network of communi- element belongs to the ith process.
cating nodes is constructed, consisting of three main types of
Rete nodes: i) the indexer layer of I NC Q UERY-D consists of S = [s1 , s2 , · · · , sn ] (1)
input nodes that are responsible for caching model element
identifiers by type and for propagating change notifications; • Memory capacity of machines: Memory capacity of ma-
ii)worker nodes implement relational data operations (such as chines that can be used by the processes. The ith element
joins and projections) that make up a query and they also store belongs to the ith machine.
the partial results to be propagated to their children nodes; iii)
production nodes are responsible for storing and maintaining C = [c1 , c2 , · · · , cm ] (2)
the final results of queries, as they reside at the endpoints of the
• Communication edges between processes: The matrix E
network. Furthermore, they provide an interface for accessing
contains the communication intensity between each two
query results and result deltas.
processes measured in normalized tuples. For the ith
1 http://4store.org/ process the ith row describes these values.
3
e1,1 e1,2 ··· e1,n wm = [w1 , w2 , · · · , wm ] (11)
e2,1 e2,2 ··· e2,n
En,n = . (3)
.. .. .. Valid allocations satisfy the same constraints just as for
.. . . . Communication Minimization: i) (6) ii) (7).
en,1 en,2 ··· en,n We include the cost of a machine if and only if at least one
• Communication overheads between machines: The matrix process is placed to the machine:
O contains the overhead multipliers between machines. n
X
For the ith machine the ith row describes these values. ∀i ∈ {1, · · · , m} : xi,j ≥ 1 ↔ wi = bi (12)
j=1
o1,1 o1,2 · · · o1,m
o2,1 o2,2 · · · o2,m As the objective, we minimize the sum of elements in the
Om,m = .
.. ..
.. (4) w vector:
.. . . . (m )
om,1 om,2 · · · om,m X
cost = min wi , (13)
The W matrix contains the calculated communication i=1
weights between processes for valid allocations.
B. Heuristics for Approximating Resource Consumption
w1,1 w1,2 · · · w1,n
w2,1 w2,2 · · · w2,n The lack of exact a-priori knowledge of optimization pa-
Wn,n = .
..
.. (5) rameters is a well-known challenge in query optimization,
.. ..
. . . which can be addressed by heuristics-based estimations to
wn,1 wn,2 ··· wn,n yield reasonable approximations in an efficient way. In our
context, the memory usage of processes is a primary target for
Valid allocations have to satisfy: i) for each machine, the
such estimations, for both problems. Furthermore, the com-
memory capacity cannot be exceeded by the processes
munication intensity (i.e. network traffic) between processes
s1 · x1,1 + s2 · x1,2 + · · · + sn · x1,n ≤ c1 is necessary for the Communication Minimization problem.
1) Importance of Precise Estimation: If we overestimate
s1 · x2,1 + s2 · x2,2 + · · · + sn · x2,n ≤ c2 the memory requirement of a process and reserve much more
.. (6)
. memory than necessary, then we can not allocate as many
processes to a machine as would otherwise be possible, and
s1 · xm,1 + s2 · xm,2 + · · · + sn · xm,n ≤ cm thus we can not reach possible better allocations. On the
ii) each process must be allocated to exactly one machine. other hand if we underestimate the memory consumption of
processes, then we risk either that the processes fail because
m
X of insufficient amount of allocated memory or we risk the
∀j : xi,j = 1, xi,j ∈ {0, 1} (7) violation of allocation constraints, and thus the overusage
i=1
of resources. As a consequence, accurate memory estimation
If two processes are placed to particular machines, the com- is required for good resource utilization and for the stable
munication intensity has to be multiplied with the overhead operation of the system.
value between the machines. For Communication Minimization, the precise estimation
of communication intensity is of key importance as it could
misdirect the allocation.
∀i, j, k, l, k 6= l : xi,k + xj,l ≥ 2 → wk,l = ek,l · oi,j (8)
For the Rete algorithm that operates with collections storing
As the objective, we minimize the sum of elements in the static structures (tuples), the memory usage and the communi-
W matrix: cation intensity of a component can be estimated using simple
linear regression methods. The independent variable of these
X
linear functions is the estimated number of model elements,
weight = min wi,k (9) measured in normalized tuples, stored by the Rete nodes.
1= 2) => w[k][l] == edges[k][l]
guage3 . We also illustrate the constraints by an example.
* overheads[i][j];
1) Input representation: Figure 4 shows the available ma- 4
chines with their memory capacities and with the overhead 5 minimize sum(i,j in 1..n) w[i][j];
multipliers between each pair of machines. Constraints for the communication cost between processes
Figure 5 shows the sample Rete network with the estima- 1 and 3 with the possibility to be placed to any pair of
tions given and already assigned to processes. machines:
We have an auxiliary matrix X to describe the allocation
constraints: (6) and (7). If xi,j = 1 then the jth process will x1,1 + x1,3 ≥ 2 → w1,3 = 200000 · 1
be allocated to the ith machine and 0 means the opposite. x1,1 + x2,3 ≥ 2 → w1,3 = 200000 · 3
(16)
2 http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer x2,1 + x1,3 ≥ 2 → w1,3 = 200000 · 3
3 http://www-01.ibm.com/software/commerce/optimization/modeling
x2,1 + x2,3 ≥ 2 → w1,3 = 200000 · 1
5
5) Solution: Given this input, the CPLEX Optimizer de- B. Results and Analysis
termines the overall communication intensity of the optimal 1) Experiment 1: Memory optimization: As it can be seen
solution to be 2,200,000 that is shown in Figure 6. in Figure 7, both the “default heap size” variant and the
“maximum heap size” variants failed to execute successfully
1 600 MB 2400 MB 2 for the largest instance model sizes, both reporting out of heap
Input node Input node 1 × 1,000,000
1 space errors in the JVMs running one of the Rete nodes.
6000 MB
As the “default heap size” variant uses 1 GB heap limits, the
JVM may get into a thrashing state under such loads and cause
200,000 1,000,000 a timeout during measurement which the Train Benchmark
framework registers as a failed run. Similarly, in the case of the
Worker node
3 × 200,000 3 × 200,000 “maximum heap size” variant, due to the lack of a reasonable
3 3200 MB
upper limit, the JVMs may interfere with each other’s memory
200,000
allocations resulting in a runtime error.
Production
node 5000 MB
4 500 MB 2 Batch validation time RouteSensor (x,y:logscale), XForm
739.00
Communication = 2,200,000
463.27
Fig. 6: Optimal allocation for the example input. 290.42
Time [s]
182.06
●
114.13
●
71.55
IV. P ERFORMANCE I MPACT OF O PTIMIZATION ●
44.85 ●
●
●
To justify the beneficial effects of Communication Mini- 28.12 ●
●
mization on query performance and stability, we conducted 6k
24k
12k
49k
23k
90k
43k
170k
88k
347k
176k
691k
361k
1M
715k
2M
1M
5M
2M
11M
94 193 348 642 1301 2k 5k 10k 21k 41k
measurements using a model validation benchmark, the Train
Nodes
Benchmark [5], a suite designed for evaluating the perfor- Edges
mance of graph pattern matching tools. Results
Tools default heap ● maximum heap optimized allocation unoptimized allocation
Fig. 7: Runtime of first evaluation of a query on a slow
A. Experiments (10 Mbit/s) network.
We designed two experiments. In the first experiment, the
impact of respecting allocation constraints is assessed by
comparing an optimized configuration to a naı̈ve setup that a) Sum of Edit times for query RouteSensor (x,y:logscale), XForm
5033.00
uses the default heap size for the Java Virtual Machines, and b) 2839.40
assigns close to the maximum RAM available to the JVMs.
1601.87
In the second experiment, we assess the impact of network ●
Time [ms]
●
903.70
traffic optimization by comparing the optimized configuration
509.83
to a non-optimized setup. To emphasize the effect of network ●
●
287.62
communication on overall performance, we configured the ●
162.26 ●
cloud environment to use connections with lower bandwith (10 ●
●
Mbit/s) that simulate a system under load. In both experiments, 91.54
6k 12k 23k 43k 88k 176k 361k 715k 1M 2M
we run a complex query of the Train Benchmark on instance 24k
94
49k
193
90k
348
170k
642
347k
1301
691k
2k
1M
5k
2M
10k
5M
21k
11M
41k
models of increasing sizes, up to 2 million nodes and 11 −9 −19 −34 −64 −130 −260 −532 −1062 −2109 −4176
million edges. Nodes
Edges
1) Hardware and software setup: The benchmark software Results
Change of result set size
configuration consisted of an extended Train Benchmark setup Tools default heap ● maximum heap optimized allocation unoptimized allocation
using the 4store (version 1.1.5) database in a clustered envi-
ronment. We ran three virtual machines (VMs) on a private Fig. 8: Runtime of the transformation phase on a slow
cloud powered by Apache VCL, each VM was running 64- (10 Mbit/s) network.
bit Ubuntu Linux 14.04 on Intel Xeon 2.5 GHz CPUs and
8 GBs of RAM. We used Oracle 64-bit Java Virtual Machines 2) Experiment 2: Network optimization: Figure 7 compares
(version 1.7.0 72). We integrated our monitoring system into the optimized variant to a “non-optimized” (shown as unopti-
the benchmark environment in order to record telemetry data at mized). We may observe that while the overall characteristics
each execution stage of the benchmark. The benchmark results are similar, the optimized variant shows a constant-multiplier
were automatically processed by the R scripts provided by the advantage, running approximately 15–20% faster than the
Train Benchmark framework. unoptimized variant. Figure 8 shows the reevaluation time
6
of a query after changes were applied to the model. In Many different mathematical models exist for the problem
the reevaluation phase, the amount of data transmitted is according to the varying requirements of systems, but up to
significantly less than in the first evaluation phase. As the our best knowledge the currently proposed model of allocation
numbers are comparatively small (which is consistent with the was never formalized and investigated for distributed systems
Train Benchmark specification), the advantage of the optimized before, especially in the context of incremental model queries.
variant is within the measurement error range.
These observations are explained by telemetry data recorded VI. C ONCLUSION
by a monitoring system during the measurement. The network We presented an approach for allocation optimization in the
traffic measurements are summarized in Figure 9, which context of distributed systems, focusing on two different opti-
compares the network traffic volume recorded for the “opti- mization targets: reducing resource usage through minimizing
mized” and “unoptimized” variants. It can be seen that while remote network traffic and cost reduction through minimizing
the overall volume is practically equivalent, its distribution the amount of computation resources required to evaluate
between local and remote links is characteristically different an incremental query. Our approach is based on a mapping
(i.e. in the “optimized” case, the overall remote volume is of the allocation optimization problems to the combinatorial
approximately 15% lower than in the “unoptimized” case). optimization domain, solved with the state-of-the-art CPLEX
Unoptimized Optimized
Optimizer. We applied and evaluated this concept to the
Traffic'[MBytes] vm0 vm1 vm2 vm0 vm1 vm2 optimization of distributed incremental model queries in the
Remote'RX+TX 300 349 371 248 280 347
Local'RX+TX 14 2 74 24 20 190 context of I NC Q UERY-D.
SUM'Remote 1020 875
SUM'Local' 90 234
Our primary aim for the future is a technological adap-
Total'traffic' 1110 1109 tation of our allocation techniques to the YARN resource
Fig. 9: Network traffic statistics. management framework [10]. This is directly motivated by
the fact that I NC Q UERY-D itself is moving towards a Hadoop-
based implementation. In this context, experience has shown
C. Threats to Validity that while YARN has several built-in schedulers for resource
allocation, those are not well-suited to long-running, memory-
For these experiments, we considered the following internal
intensive jobs that are inter-related by complex structural
and external threats to validity. As transient and unknown
constraints (such as the Rete network). Our long-term goal
background load in the cloud environment can introduce noise,
is to generalize the CPLEX-based optimization approach to a
we performed several execution runs and considered their
wider range of Hadoop/YARN applications.
minimum value for the result plots as a countermeasure. We
strived to avoid systematic errors in the experiments (e.g.
incorrect queries or workloads) by cross-checking all results R EFERENCES
with the specification of the Train Benchmark. The validity of [1] Szárnyas, G. et al.: IncQuery-D: A Distributed Incremental Model Query
Framework in the Cloud. In: ACM/IEEE 17th International Conference
the analysis and especially the generalizability of the results on Model Driven Engineering Languages and Systems, 2014, Valencia,
to real-world workloads has been thoroughly investigated in Spain, Springer (2014)
several academic papers on the Train Benchmark (e.g. [1], [2] Forgy, C.L.: Rete: A fast algorithm for the many pattern/many object
pattern match problem. Artificial intelligence 19(1) (1982) 17–37
[6]). We believe that our measurements are faithful extensions [3] Makai, J.: Optimization of Incremental Queries in the Cloud. Report.
of the Train Benchmark and thus the results of these previous https://tdk.bme.hu/VIK/DownloadPaper/Inkrementalis-lekerdezesek-
works apply to our contributions as well. optimalizacioja-a (2014)
[4] Van Hentenryck, P., Simonis, H., Dincbas, M.: Constraint satisfaction
using constraint logic programming. Artificial intelligence 58(1) (1992)
V. R ELATED W ORK 113–159
Based on an idea originating in mathematical economics, [7] [5] Szárnyas, G., Semeráth, O., Ráth, I., Varró, D.: The TTC 2015 Train
Benchmark Case for Incremental Model Validation. Transformation Tool
proposes a characteristically different performance model Contest, 2015
compared to our domain. In these cases, the cost of communi- [6] Izsó, B., Szatmári, Z., Bergmann, G., Horváth, Á., Ráth, I.: Towards
cation comes from the costs of reaching data from other nodes, Precise Metrics for Predicting Graph Query Performance. In 2013
IEEE/ACM 28th International Conference on Automated Software En-
which is not applicable for data-driven distributed systems gineering (ASE), pages 412-431, Silicon Valley, CA, USA, 2013. IEEE.
such as I NC Q UERY-D. Furthermore, global allocation resource [7] Ferguson, D.F. et al.: 7. In: ECONOMIC MODELS FOR ALLOCAT-
constraints are not considered. There are other models [8], ING RESOURCES IN COMPUTER SYSTEMS. (1996) 156–183
[8] Fernández-Baca, D.: Allocating modules to processors in a distributed
where the cost can include the execution costs of tasks on system. Software Engineering, IEEE Transactions on 15(11) (1989)
particular processors besides the communication cost, which 1427–1436
is a planned future direction of our work. [9] Apers, P.M.G., Hevner, A.R., Yao, S.B.: Optimization algorithms for
distributed queries. Software Engineering, IEEE Transactions on (1)
Apers et al. developed a scheduling-based optimization ap- (1983) 57–68
proach [9] for distributed queries over relational databases. In [10] Vavilapalli, Vinod Kumar, et al.: ”Apache hadoop yarn: Yet another
their model, relational operations are scheduled on computers. resource negotiator.” Proceedings of the 4th annual Symposium on Cloud
Computing. ACM, 2013
The model uses the estimated time of different operations and
data transmission, as it is required for scheduling. However,
as relational database queries are not incremental, memory
constraints for computers are not considered.