1


                     Optimization of Incremental Queries
                                in the Cloud
                  József Makai, Gábor Szárnyas, Ákos Horváth, István Ráth, Dániel Varró
                                  Fault Tolerant Systems Research Group
                           Department of Measurement and Information Systems
                            Budapest University of Technology and Economics
                                      H-1117, Magyar Tudósok krt. 2.
                                              Budapest, Hungary
        jozsef.makai@inf.mit.bme.hu,{szarnyas,ahorvath,rath,varro}@mit.bme.hu


   Abstract—Database and model queries are the foundations               of memory-intensive distributed components to computation
of data-driven applications. Their performance is of primary             resources according to a strategy that is optimal in terms
importance in model-driven software engineering (MDE), es-               of overall cost, or favourable in terms of query evaluation
pecially with the evergrowing complexity of software modeling
projects. To address the scalability issues of traditional MDE           performance. We propose to use combinatorial optimization
tools, distributed frameworks are being developed – however,             methods based on constraint satisfaction programming, aided
query optimization in a distributed context brings a whole new           by heuristics to estimate i) resource usage of individual
set of challenges, including capacity limits of individual nodes         system components and ii) network traffic between distributed
and network communication. In this paper, we aim to address              computation nodes. Our aim is to provide a fully automated
the allocation optimization challenge in the context of distributed
incremental query evaluation. Our methods are based on a                 optimization allocation mechanism for I NC Q UERY-D.
combination of heuristics-based resource consumption estima-
tions and constraint satisfaction programming. We evaluate the
impact of the optimization techniques by conducting benchmark                                  II. BACKGROUND
measurements.
                                                                         A. The Allocation Problem
                       I. I NTRODUCTION                                     In the context of distributed systems, allocation means the
   Model-driven software engineering (MDE) plays an impor-               assignment of computation resources to computation tasks.
tant role in the development processes of critical embedded              The input of the allocation consists of the computation tasks
systems. With the dramatic increase in complexity that is                and the edges between them (which represents data-flow be-
also affecting critical embedded systems in recent years,                tween two tasks), the available machines with their capacities
modeling toolchains are facing scalability challenges as the             and possibly the quality of network links between machines.
size of design models constantly increases, and automated tool           Computation tasks have to be assigned to processes first,
features become more sophisticated.                                      as we run processes (Java Virtual Machines specifically for
   Many scalability issues can be addressed by improving                 I NC Q UERY-D) on machines, and the resource consumption of
query performance. Incremental model queries aim to reduce               a process consists of the collective resource consumption of
query response time by limiting the impact of model modi-                tasks assigned to this particular process. The output of the
fications to query result calculation. This technique has been           allocation is a possible mapping of processes to machines.
proven to improve performance dramatically in several cases                 1) Allocation constraints: Allocation constraints are de-
(e.g. on-the-fly well-formedness validation or model synchro-            rived from the observation that certain subsets of computation
nization), at the cost of increased memory consumption.                  tasks use so much resources together that allocating them to
   I NC Q UERY-D [1] aims to address the memory consumption              a set of machines without sufficient resources would cause a
problem by offloading memory-intensive components to a                   bottleneck in the query evaluation or even cause the failure of
distributed architecture. I NC Q UERY-D is a system based on             the computation. We call an allocation valid, if the constraints
a distributed Rete network [2] that can scale up to handle               are satisfied. We assume that the tasks of query evaluation
very large models (typically stored in distributed NoSQL or              are memory-intensive computations, therefore the allocation
graph databases) and complex queries efficiently.                        constraints consist of memory constraints in our case. The
   However, the introduction of the cloud and grid-like ar-              goal of allocation optimization is to provide valid allocation
chitecture introduces a whole set of query optimization chal-            optimized for certain optimization targets.
lenges. In this paper, we focus on the (static) allocation opti-
mization problem, which in our context means the assignment
  This work was partially supported by the MONDO (EU ICT-611125)         B. Optimization Targets
project and Lendület programme (MTA-BME Lendület Cyber-Physical Sys-
tems Research Group).                                                      In this paper, we consider two possible optimization targets.
                                                                                                                                                                                                              2


   1) Communication Minimization: This objective is sup-                                                                       III. A LLOCATION O PTIMIZATION IN I NC Q UERY-D
posed to increase the performance of query evaluation by                                            A. Formalization of the Allocation Problems
reducing the overhead of network communication originated
from the data transmission in the distributed system. This is                                          1) Communication Minimization: For this problem, we
mainly useful when time spent with network communication                                            need to represent communication intensity between processes
is significant compared to the time of local computations per-                                      and quality of network connection between machines in a
formed by tasks, that is true for distributed query evaluation.                                     mathematical model to approximate data transmission time.
   2) Cost Minimization: Cost Minimization aims to reduce                                           For communication intensity, we defined our own logical data
the monetary cost of computation infrastructure the system                                          unit, the normalized tuple, which is based on the simplification
is operated on. This technique is useful for most distributed                                       that all data inside the Rete network is represented by tuples
systems, as large systems usually require a lot of resources and                                    made up of identical scalar elements (strings).
thus, are expensive to operate, especially when the system runs                                        Definition 1 (Normalized tuple): Given a set of model ele-
in a public cloud infrastructure.                                                                   ments organized in tuples (with the same arity), the normalized
                                                                                                    tuple is defined as the product of the tuple arity (number of
C. I NC Q UERY-D and the Rete Algorithm                                                             fields) and the number of tuples.
   I NC Q UERY-D [1] is a distributed, incremental model query                                         Definition 2 (Overhead multiplier): To describe the quality
engine that aims to address scalability issues of incremental                                       of network connections, we use a simplification called over-
queries over large models. The high-level architecture of the                                       head multiplier. This value represents all important parameters
system is shown in Figure 1. I NC Q UERY-D stores the models                                        of the network connection as a scalar that allows us to compare
in distributed databases, as the models can be arbitrarily large.                                   the transmission times associated to identical (logical) volumes
A typical I NC Q UERY-D installation uses a distributed graph                                       of data across various connections.
database such as 4store1 .                                                                             The use of normalized tuple and overhead multiplier have
                                                                                                    been validated by empirical measurements, as shown in Figure
                          Database Cluster                                                          2. We can also observe that transmission time within the same
  Database                       Database                                                           host is characteristically smaller than between different hosts.
   shard 1                        shard n                Query
        DB Server 1      DB Server n
                                                                                                                                                       Time to send large data
                                                                                nodes
                                                                                Input


                                               Input      Input        Input                                     1000000                                                          y = 0.0089x
        Load and Modify Data                   node       node         node
                                                                                                                                                                                  R² = 0.9993
                                                                                                                          100000
                                                                                                    Time (Milliseconds)


         Distributed Indexer                   Worker              Worker
                                                                                Production Worker


                                                                                                                              10000
                                                                                           nodes


        Model access adapter                    node                node
                                                                                                                                                                                       y = 0.0017x
     Notifications                                        Worker       Worker                                                 1000
                                                                                                                                                                                       R² = 0.9835
    Distributed query evaluation                           node         node
                                                                                                                               100
              network
                                                                                  nodes


                                             Production Production Production
                                                node       node       node                                                      10
 Server 1                Server n
                                                                                                                                  1
                      IncQuery-D Runtime
                                                                                                                                  50000                 500000               5000000             50000000
                             Allocation                 Production                                                                                            Normalized Tuples
            Query             of Query                     node                                                                       Same Host                             Remote Host
                                                         Process
 Resources                                                                                                                            Linear Regression (Same Host)         Linear Regression (Remote Host)
                    Allocation Optimizer                 ?        ?
                                                                                                                                Fig. 2: Measurements of large volume data traffic.

                Fig. 1: Architecture of I NC Q UERY-D.                                                With these notions, the Communication Minimization prob-
                                                                                                    lem can be formalized as follows. The input:
   1) The Rete Algorithm: The theoretical foundations of                                              • Memory requirements of processes: Vector with the
I NC Q UERY-D are provided by the Rete algorithm [2]. For                                               predicted memory consumption of processes. The ith
each query, an asynchronous dataflow network of communi-                                                element belongs to the ith process.
cating nodes is constructed, consisting of three main types of
Rete nodes: i) the indexer layer of I NC Q UERY-D consists of                                                                                            S = [s1 , s2 , · · · , sn ]                     (1)
input nodes that are responsible for caching model element
identifiers by type and for propagating change notifications;                                                             •    Memory capacity of machines: Memory capacity of ma-
ii)worker nodes implement relational data operations (such as                                                                  chines that can be used by the processes. The ith element
joins and projections) that make up a query and they also store                                                                belongs to the ith machine.
the partial results to be propagated to their children nodes; iii)
production nodes are responsible for storing and maintaining                                                                                             C = [c1 , c2 , · · · , cm ]                     (2)
the final results of queries, as they reside at the endpoints of the
                                                                                                                          •    Communication edges between processes: The matrix E
network. Furthermore, they provide an interface for accessing
                                                                                                                               contains the communication intensity between each two
query results and result deltas.
                                                                                                                               processes measured in normalized tuples. For the ith
  1 http://4store.org/                                                                                                         process the ith row describes these values.
                                                                                                                                          3


                                                              
                             e1,1         e1,2    ···     e1,n                                wm = [w1 , w2 , · · · , wm ]             (11)
                            e2,1         e2,2    ···     e2,n 
                    En,n =  .                                      (3)
                                                              
                                           ..     ..        ..             Valid allocations satisfy the same constraints just as for
                            ..             .        .       .           Communication Minimization: i) (6) ii) (7).
                                 en,1     en,2    ···     en,n              We include the cost of a machine if and only if at least one
  •    Communication overheads between machines: The matrix               process is placed to the machine:
       O contains the overhead multipliers between machines.                                                n
                                                                                                            X
       For the ith machine the ith row describes these values.                      ∀i ∈ {1, · · · , m} :         xi,j ≥ 1 ↔ wi = bi   (12)
                                                                                                          j=1
                           o1,1 o1,2 · · · o1,m
                          o2,1 o2,2 · · · o2,m                            As the objective, we minimize the sum of elements in the
                 Om,m =  .
                         
                                     ..  ..
                                                    
                                                 ..        (4)           w vector:
                          ..         .     .     .                                                    (m     )
                           om,1 om,2 · · · om,m                                                          X
                                                                                             cost = min      wi ,               (13)
  The W matrix contains the calculated communication                                                              i=1
weights between processes for valid allocations.
                                                                        B. Heuristics for Approximating Resource Consumption
                      w1,1 w1,2 · · · w1,n
                     w2,1 w2,2 · · · w2,n                                  The lack of exact a-priori knowledge of optimization pa-
            Wn,n =  .
                    
                               ..
                                                
                                             ..  (5)                     rameters is a well-known challenge in query optimization,
                     ..            ..
                                .      .      .                          which can be addressed by heuristics-based estimations to
                            wn,1        wn,2     ···     wn,n             yield reasonable approximations in an efficient way. In our
                                                                          context, the memory usage of processes is a primary target for
 Valid allocations have to satisfy: i) for each machine, the
                                                                          such estimations, for both problems. Furthermore, the com-
memory capacity cannot be exceeded by the processes
                                                                          munication intensity (i.e. network traffic) between processes
              s1 · x1,1 + s2 · x1,2 + · · · + sn · x1,n ≤ c1              is necessary for the Communication Minimization problem.
                                                                             1) Importance of Precise Estimation: If we overestimate
              s1 · x2,1 + s2 · x2,2 + · · · + sn · x2,n ≤ c2              the memory requirement of a process and reserve much more
                                                       ..           (6)
                                                        .                 memory than necessary, then we can not allocate as many
                                                                          processes to a machine as would otherwise be possible, and
           s1 · xm,1 + s2 · xm,2 + · · · + sn · xm,n ≤ cm                 thus we can not reach possible better allocations. On the
  ii) each process must be allocated to exactly one machine.              other hand if we underestimate the memory consumption of
                                                                          processes, then we risk either that the processes fail because
                          m
                          X                                               of insufficient amount of allocated memory or we risk the
                   ∀j :         xi,j = 1, xi,j ∈ {0, 1}             (7)   violation of allocation constraints, and thus the overusage
                          i=1
                                                                          of resources. As a consequence, accurate memory estimation
  If two processes are placed to particular machines, the com-            is required for good resource utilization and for the stable
munication intensity has to be multiplied with the overhead               operation of the system.
value between the machines.                                                  For Communication Minimization, the precise estimation
                                                                          of communication intensity is of key importance as it could
                                                                          misdirect the allocation.
      ∀i, j, k, l, k 6= l : xi,k + xj,l ≥ 2 → wk,l = ek,l · oi,j    (8)
                                                                             For the Rete algorithm that operates with collections storing
 As the objective, we minimize the sum of elements in the                 static structures (tuples), the memory usage and the communi-
W matrix:                                                                 cation intensity of a component can be estimated using simple
                                                                        linear regression methods. The independent variable of these
                           
                            X           
                                                                         linear functions is the estimated number of model elements,
              weight = min          wi,k              (9)                 measured in normalized tuples, stored by the Rete nodes.
                           
                            1<i<n       
                                                                            2) Calculation and Propagation of Estimations: The core
                                          1<k<n                           idea is to estimate the amount of outbound network traffic
  2) Cost Minimization: The input:                                        from a Rete node by the amount of data it contains, as in
  • Memory requirements (1) and memory capacities (2) are
                                                                          the initialization phase of the network, the entirety of this
    same as for Communication Minimization.                               data will be transmitted. On the other hand, data storing Rete
  • Cost of machines: The monetary cost of machines.
                                                                          nodes must store all the data, that is propagated by their parent
                                                                          nodes.
                            B = [b1 , b2 , · · · , bm ]            (10)      This principle can be applied and propagated through the
                                                                          complete Rete structure using breadth-first search (BFS). Fig-
  The vector w contains the cost of each machine in the                   ure 3 shows an example where we marked which iteration the
system, the cost of the machine if used, otherwise 0.                     different estimations can be calculated in. The white circle
                                                                                                                                              4


means the memory estimation and the black circle means the                                        1                     2
communication intensity estimation. After the level of input                    1                                                         1
                                                                                                              3
nodes (for those the amount of stored and so the propagated
data can be known exactly, as those propagate all their input                           6000 MB                             5000 MB
data), the estimations for their direct children nodes can be                             Fig. 4: Available infrastructure.
calculated in the second iteration and so on. It is important to
note that we can not calculate estimations for a node until the
                                                                                    1     600 MB                        2400 MB       2
estimations are calculated for all its parent nodes.
                                                                                         Input node                     Input node
                              Model
                             database                                                       200,000                     1,000,000
                                          Number of model
                                             elements
                                                                                                          Worker node

          Input node        Input node           Input node                                           3           3200 MB
                        1                 1                   1
                                                                                                                  200,000
                    1         1       1            1
                                                                                                      Production node

          Worker node                         Worker node                                             4     500 MB
                    2                                    2
                                  2                           2              Fig. 5: Rete node and Processes with estimations.
                2
                            Worker node             Worker node
                                       3                      3
                                   3                     3                 2) Memory Capacities: Constraint (6) can be stated in OPL
                                                                        with the following code where the processes array contains
          Production        Production           Production             the memory requirement of each process and capacities array
             node    3         node    4            node    4
                                                                        contains the memory capacities of machines.
Fig. 3: Computation of heuristics for a sample Rete network. 1 forall (i in 1..m)
                                                                    2         sum(j in 1..n)
                                                                    3               processes[j]*x[i][j] <= capacities[i];
    The estimations for different types of Rete nodes are calcu-
lated following different rules that were refined in empirical                The constraint is the following for machine 1 :
experiments carried out with typical model-driven engineering
workloads and models (sparse graphs). We present a few                        600 · x1,1 + 2400 · x1,2 + 3200 · x1,3 + 500 · x1,4 ≤ 6000
examples for these rules, for more examples see [3].                                                                                  (14)
    When i) joining nodes of the model through edges, we                      3) 1-to-1 Allocation: Constraint (7) in OPL:
expect to have 1% of the input set size (Cartesian product 1 forall (j in 1..n)
of the two input sets) as result. For ii) selection nodes (which 2               sum(i in 1..m)
filters elements according to criteria), we expect to have 10% 3                         x[i][j] == 1;

of input elements as result, as this is a good upper-bound                    The constraint for process 1 is:
estimation for usual queries.
                                                                                                    x1,1 + x2,1 = 1                   (15)
C. Solution by Constraint Satisfaction
                                                                              4) Objective Function: The OPL code for (8) and (9)
    We proved both the Communication and Cost Minimization constraints, where the edges matrix (n × n) contains the
problems to be N P-hard [3], therefore we propose to solve communication intensity values between every two processes
the problems by constraint satisfaction programming (CSP) [4] and the overheads matrix (m × m) contains the overhead
using the IBM CPLEX Optimizer2 as the platform of imple- multipliers between machines.
mentation. We show how the Communication Optimization 1 forall (i,j in 1..m)
problem can be formalized with CPLEX and the OPL lan- 2                          forall (k,l in 1..n)
                                                                          3              (x[i][k]+x[j][l] >= 2) => w[k][l] == edges[k][l]
guage3 . We also illustrate the constraints by an example.
                                                                                               * overheads[i][j];
    1) Input representation: Figure 4 shows the available ma- 4
chines with their memory capacities and with the overhead 5 minimize sum(i,j in 1..n) w[i][j];
multipliers between each pair of machines.                                    Constraints for the communication cost between processes
    Figure 5 shows the sample Rete network with the estima-                  1 and 3 with the possibility to be placed to any pair of
tions given and already assigned to processes.                              machines:
    We have an auxiliary matrix X to describe the allocation
constraints: (6) and (7). If xi,j = 1 then the jth process will                          x1,1 + x1,3 ≥ 2 → w1,3 = 200000 · 1
be allocated to the ith machine and 0 means the opposite.                                x1,1 + x2,3 ≥ 2 → w1,3 = 200000 · 3
                                                                                                                                      (16)
   2 http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer                x2,1 + x1,3 ≥ 2 → w1,3 = 200000 · 3
  3 http://www-01.ibm.com/software/commerce/optimization/modeling
                                                                                        x2,1 + x2,3 ≥ 2 → w1,3 = 200000 · 1
                                                                                                                                                                                                    5


   5) Solution: Given this input, the CPLEX Optimizer de-                    B. Results and Analysis
termines the overall communication intensity of the optimal                     1) Experiment 1: Memory optimization: As it can be seen
solution to be 2,200,000 that is shown in Figure 6.                          in Figure 7, both the “default heap size” variant and the
                                                                             “maximum heap size” variants failed to execute successfully
 1 600 MB                    2400 MB 2                                       for the largest instance model sizes, both reporting out of heap
  Input node                 Input node   1 × 1,000,000
                                                                1            space errors in the JVMs running one of the Rete nodes.
                                                                6000 MB
                                                                                As the “default heap size” variant uses 1 GB heap limits, the
                                                                             JVM may get into a thrashing state under such loads and cause
     200,000                 1,000,000                                       a timeout during measurement which the Train Benchmark
                                                                             framework registers as a failed run. Similarly, in the case of the
               Worker node
                                              3 × 200,000      3 × 200,000   “maximum heap size” variant, due to the lack of a reasonable
         3        3200 MB
                                                                             upper limit, the JVMs may interfere with each other’s memory
          200,000
                                                                             allocations resulting in a runtime error.
                Production
                   node                                         5000 MB
          4      500 MB                                         2                                      Batch validation time RouteSensor (x,y:logscale), XForm
                                                                                     739.00
                                               Communication = 2,200,000
                                                                                     463.27

      Fig. 6: Optimal allocation for the example input.                              290.42


                                                                             Time [s]
                                                                                     182.06
                                                                                                                                                                           ●

                                                                                     114.13
                                                                                                                                                                 ●
                                                                                         71.55
      IV. P ERFORMANCE I MPACT OF O PTIMIZATION                                                                                                        ●
                                                                                         44.85                                              ●
                                                                                                                                    ●
                                                                                                                           ●
  To justify the beneficial effects of Communication Mini-                               28.12     ●
                                                                                                                ●


mization on query performance and stability, we conducted                                         6k
                                                                                                  24k
                                                                                                               12k
                                                                                                               49k
                                                                                                                          23k
                                                                                                                          90k
                                                                                                                                   43k
                                                                                                                                  170k
                                                                                                                                           88k
                                                                                                                                          347k
                                                                                                                                                     176k
                                                                                                                                                     691k
                                                                                                                                                               361k
                                                                                                                                                                1M
                                                                                                                                                                         715k
                                                                                                                                                                          2M
                                                                                                                                                                                   1M
                                                                                                                                                                                   5M
                                                                                                                                                                                              2M
                                                                                                                                                                                             11M
                                                                                                  94           193        348     642     1301        2k        5k        10k      21k       41k
measurements using a model validation benchmark, the Train
                                                                                                                                             Nodes
Benchmark [5], a suite designed for evaluating the perfor-                                                                                   Edges
mance of graph pattern matching tools.                                                                                                       Results
                                                                                                   Tools        default heap ● maximum heap       optimized allocation   unoptimized allocation


                                                                             Fig. 7: Runtime of first evaluation of a query on a slow
A. Experiments                                                               (10 Mbit/s) network.
   We designed two experiments. In the first experiment, the
impact of respecting allocation constraints is assessed by
comparing an optimized configuration to a naı̈ve setup that a)                                    Sum of Edit times for query RouteSensor (x,y:logscale), XForm
                                                                                     5033.00
uses the default heap size for the Java Virtual Machines, and b)                     2839.40
assigns close to the maximum RAM available to the JVMs.
                                                                                     1601.87
In the second experiment, we assess the impact of network                                                                                                                  ●
                                                                             Time [ms]


                                                                                                                                                                 ●
                                                                                         903.70
traffic optimization by comparing the optimized configuration
                                                                                         509.83
to a non-optimized setup. To emphasize the effect of network                                                                                 ●
                                                                                                                                                        ●


                                                                                         287.62
communication on overall performance, we configured the                                                                             ●

                                                                                         162.26                             ●
cloud environment to use connections with lower bandwith (10                                       ●
                                                                                                                ●


Mbit/s) that simulate a system under load. In both experiments,                           91.54
                                                                                                    6k         12k         23k      43k     88k       176k     361k       715k     1M         2M
we run a complex query of the Train Benchmark on instance                                          24k
                                                                                                   94
                                                                                                               49k
                                                                                                               193
                                                                                                                           90k
                                                                                                                           348
                                                                                                                                   170k
                                                                                                                                   642
                                                                                                                                           347k
                                                                                                                                           1301
                                                                                                                                                      691k
                                                                                                                                                       2k
                                                                                                                                                                1M
                                                                                                                                                                5k
                                                                                                                                                                           2M
                                                                                                                                                                           10k
                                                                                                                                                                                   5M
                                                                                                                                                                                   21k
                                                                                                                                                                                             11M
                                                                                                                                                                                             41k
models of increasing sizes, up to 2 million nodes and 11                                           −9          −19         −34     −64     −130       −260     −532      −1062    −2109     −4176

million edges.                                                                                                                             Nodes
                                                                                                                                            Edges
   1) Hardware and software setup: The benchmark software                                                                                  Results
                                                                                                                                    Change of result set size
configuration consisted of an extended Train Benchmark setup                                           Tools        default heap ● maximum heap   optimized allocation   unoptimized allocation
using the 4store (version 1.1.5) database in a clustered envi-
ronment. We ran three virtual machines (VMs) on a private                    Fig. 8: Runtime of the transformation phase on a slow
cloud powered by Apache VCL, each VM was running 64-                         (10 Mbit/s) network.
bit Ubuntu Linux 14.04 on Intel Xeon 2.5 GHz CPUs and
8 GBs of RAM. We used Oracle 64-bit Java Virtual Machines                       2) Experiment 2: Network optimization: Figure 7 compares
(version 1.7.0 72). We integrated our monitoring system into                 the optimized variant to a “non-optimized” (shown as unopti-
the benchmark environment in order to record telemetry data at               mized). We may observe that while the overall characteristics
each execution stage of the benchmark. The benchmark results                 are similar, the optimized variant shows a constant-multiplier
were automatically processed by the R scripts provided by the                advantage, running approximately 15–20% faster than the
Train Benchmark framework.                                                   unoptimized variant. Figure 8 shows the reevaluation time
                                                                                                                                                     6


of a query after changes were applied to the model. In                     Many different mathematical models exist for the problem
the reevaluation phase, the amount of data transmitted is                according to the varying requirements of systems, but up to
significantly less than in the first evaluation phase. As the            our best knowledge the currently proposed model of allocation
numbers are comparatively small (which is consistent with the            was never formalized and investigated for distributed systems
Train Benchmark specification), the advantage of the optimized           before, especially in the context of incremental model queries.
variant is within the measurement error range.
   These observations are explained by telemetry data recorded                                      VI. C ONCLUSION
by a monitoring system during the measurement. The network                  We presented an approach for allocation optimization in the
traffic measurements are summarized in Figure 9, which                   context of distributed systems, focusing on two different opti-
compares the network traffic volume recorded for the “opti-              mization targets: reducing resource usage through minimizing
mized” and “unoptimized” variants. It can be seen that while             remote network traffic and cost reduction through minimizing
the overall volume is practically equivalent, its distribution           the amount of computation resources required to evaluate
between local and remote links is characteristically different           an incremental query. Our approach is based on a mapping
(i.e. in the “optimized” case, the overall remote volume is              of the allocation optimization problems to the combinatorial
approximately 15% lower than in the “unoptimized” case).                 optimization domain, solved with the state-of-the-art CPLEX
                             Unoptimized               Optimized
                                                                         Optimizer. We applied and evaluated this concept to the
  Traffic'[MBytes]     vm0       vm1       vm2   vm0     vm1       vm2   optimization of distributed incremental model queries in the
   Remote'RX+TX        300       349       371   248     280       347
    Local'RX+TX         14        2         74    24      20       190   context of I NC Q UERY-D.
   SUM'Remote                   1020                     875
     SUM'Local'                  90                      234
                                                                            Our primary aim for the future is a technological adap-
    Total'traffic'              1110                     1109            tation of our allocation techniques to the YARN resource
                     Fig. 9: Network traffic statistics.                 management framework [10]. This is directly motivated by
                                                                         the fact that I NC Q UERY-D itself is moving towards a Hadoop-
                                                                         based implementation. In this context, experience has shown
C. Threats to Validity                                                   that while YARN has several built-in schedulers for resource
                                                                         allocation, those are not well-suited to long-running, memory-
   For these experiments, we considered the following internal
                                                                         intensive jobs that are inter-related by complex structural
and external threats to validity. As transient and unknown
                                                                         constraints (such as the Rete network). Our long-term goal
background load in the cloud environment can introduce noise,
                                                                         is to generalize the CPLEX-based optimization approach to a
we performed several execution runs and considered their
                                                                         wider range of Hadoop/YARN applications.
minimum value for the result plots as a countermeasure. We
strived to avoid systematic errors in the experiments (e.g.
incorrect queries or workloads) by cross-checking all results                                          R EFERENCES
with the specification of the Train Benchmark. The validity of            [1] Szárnyas, G. et al.: IncQuery-D: A Distributed Incremental Model Query
                                                                              Framework in the Cloud. In: ACM/IEEE 17th International Conference
the analysis and especially the generalizability of the results               on Model Driven Engineering Languages and Systems, 2014, Valencia,
to real-world workloads has been thoroughly investigated in                   Spain, Springer (2014)
several academic papers on the Train Benchmark (e.g. [1],                 [2] Forgy, C.L.: Rete: A fast algorithm for the many pattern/many object
                                                                              pattern match problem. Artificial intelligence 19(1) (1982) 17–37
[6]). We believe that our measurements are faithful extensions            [3] Makai, J.: Optimization of Incremental Queries in the Cloud. Report.
of the Train Benchmark and thus the results of these previous                 https://tdk.bme.hu/VIK/DownloadPaper/Inkrementalis-lekerdezesek-
works apply to our contributions as well.                                     optimalizacioja-a (2014)
                                                                          [4] Van Hentenryck, P., Simonis, H., Dincbas, M.: Constraint satisfaction
                                                                              using constraint logic programming. Artificial intelligence 58(1) (1992)
                      V. R ELATED W ORK                                       113–159
   Based on an idea originating in mathematical economics, [7]            [5] Szárnyas, G., Semeráth, O., Ráth, I., Varró, D.: The TTC 2015 Train
                                                                              Benchmark Case for Incremental Model Validation. Transformation Tool
proposes a characteristically different performance model                     Contest, 2015
compared to our domain. In these cases, the cost of communi-              [6] Izsó, B., Szatmári, Z., Bergmann, G., Horváth, Á., Ráth, I.: Towards
cation comes from the costs of reaching data from other nodes,                Precise Metrics for Predicting Graph Query Performance. In 2013
                                                                              IEEE/ACM 28th International Conference on Automated Software En-
which is not applicable for data-driven distributed systems                   gineering (ASE), pages 412-431, Silicon Valley, CA, USA, 2013. IEEE.
such as I NC Q UERY-D. Furthermore, global allocation resource            [7] Ferguson, D.F. et al.: 7. In: ECONOMIC MODELS FOR ALLOCAT-
constraints are not considered. There are other models [8],                   ING RESOURCES IN COMPUTER SYSTEMS. (1996) 156–183
                                                                          [8] Fernández-Baca, D.: Allocating modules to processors in a distributed
where the cost can include the execution costs of tasks on                    system. Software Engineering, IEEE Transactions on 15(11) (1989)
particular processors besides the communication cost, which                   1427–1436
is a planned future direction of our work.                                [9] Apers, P.M.G., Hevner, A.R., Yao, S.B.: Optimization algorithms for
                                                                              distributed queries. Software Engineering, IEEE Transactions on (1)
   Apers et al. developed a scheduling-based optimization ap-                 (1983) 57–68
proach [9] for distributed queries over relational databases. In         [10] Vavilapalli, Vinod Kumar, et al.: ”Apache hadoop yarn: Yet another
their model, relational operations are scheduled on computers.                resource negotiator.” Proceedings of the 4th annual Symposium on Cloud
                                                                              Computing. ACM, 2013
The model uses the estimated time of different operations and
data transmission, as it is required for scheduling. However,
as relational database queries are not incremental, memory
constraints for computers are not considered.