Towards Decentralized Auto-Scaling Policies for
     Data Stream Processing Applications

                                 Gabriele Russo Russo

        Department of Civil Engineering and Computer Science Engineering
                      University of Rome Tor Vergata, Italy
                           russo.russo@ing.uniroma2.it


       Abstract Data Stream Processing applications can process large data
       volumes in near real-time. In order to face varying workloads in a scalable
       and cost-effective manner, it is critical to adjust the application parallelism
       at run-time. We formulate the elasticity problem as a Markov Decision
       Process (MDP). As the MDP resolution requires full knowledge of the
       system dynamics, which is rarely available, we rely on model based
       Reinforcement Learning to improve the scaling policy at run-time. We
       show promising results even for a decentralized approach, compared to
       the optimal MDP solution.


Keywords: Data Stream Processing, Elasticity, Reinforcement Learning


1   Introduction
New emerging application scenarios (e.g., social analytics, fraud detection, Smart
City) leverage Data Stream Processing (DSP) to process data streams in near
real-time. A DSP application is usually represented as a directed acyclic graph
(DAG), with data sources and operators as vertices, and streams as edges [5].
Each operator continuously receives data (e.g, tuples), applies a transformation,
and generates new outgoing streams.
    A commonly adopted DSP optimization is data parallelism, which consists of
scaling-in/out the parallel instances of the operators, so that each processes a
portion of the incoming data (at the cost of using more resources) [5]. Due to the
unpredictable and variable rate at which the sources produce the streams, a key
feature for DSP systems is the capability of elastically adjusting the parallelism
at run-time. Most of the existing DSP frameworks allow to allocate more than
one replica per operator, but their support for the run-time reconfiguration is
quite limited, as regards both the mechanisms and the policies.
    In this paper, we focus on the auto-scaling policies. We formalize the elasticity
problem for a DSP application as a Markov Decision Process (MDP), presenting
both centralized and decentralized formulations. Unfortunately, in practice the
optimal MDP policy cannot be determined, because several system dynamics
may be unknown. To cope with the model uncertainty, we rely on Reinforcement
Learning (RL) approaches, which learn the optimal MDP policy on-line by


N. Herzberg, C. Hochreiner, O. Kopp, J. Lenhard (Eds.): 10th ZEUS Workshop, ZEUS 2018,
   Dresden, Germany, 8-9 February 2018, published at http://ceur-ws.org/Vol-2072
48        Gabriele Russo Russo

interacting with the system. Specifically, we present a model based RL solution
that leverages the partial knowledge of the model to speedup the learning process.
    Elasticity for DSP is attracting many research efforts [1], with most approaches
relying on heuristics to determine the scaling decisions. An optimization model
that also considers the operator placement problem has been presented in [2], but
it cannot be easily solved in a decentralized manner. Here we describe a simpler
model, for which we can derive a decentralized formulation. The application of RL
techniques to DSP elasticity is quite limited. Heinze et al. [4] propose a simple RL
approach to control the system utilization, but they focus on infrastructure-level
elasticity. Lombardi et al. [6] exploit RL in their elasticity framework as well,
but the learning algorithm is only used for thresholds tuning. In [3] different RL
algorithms have been compared for solving the elasticity problem for a single
DSP operator in isolation, while in this work we consider whole applications.
    In the rest of this paper, we first formulate the elasticity problem as an MDP
in Sect. 2, presenting in Sect. 3 the RL based algorithm for learning the scaling
policy; we evaluate the proposed solutions in Sect. 4, and conclude in Sect. 5.


2     Problem Formulation

In this paper, we consider the elasticity problem for a DSP application composed
of 𝑁 operators. Each operator is possibly replicated into a number of instances
and, without lack of generality, we assume even distribution of the incoming data
among the parallel instances. For each operator, an Operator Manager monitors
the operator functionality, while an Application Manager supervises the whole
application. The number of parallel instances used by each operator is adjusted
either by its Operator Manager (decentralized adaptation) or by the Application
Manager (centralized adaptation).
    At each decision step, for each operator we can add an instance, terminate one,
or keep the current parallelism. Following a scaling decision, the operator is subject
to a reconfiguration process; as the integrity of the streams and the operator
internal state must be preserved, the whole application is usually paused during
the process, leading to downtime [2]. Our goal is to take reconfiguration decisions
as to minimize a long-term cost function which accounts for the downtime and
for the monetary cost to run the application. The latter comprises (i) the cost
of the instances allocated for the next decision period, and (ii) a penalty in
case of a Service Level Agreement (SLA) violation. In particular, we consider a
constraint on the application response time1 , so that a penalty is paid every time
the response time exceeds a given threshold.
    In order to keep the system model simple, we consider a deployment scenario
with (i) homogeneous computing resources on which the operator instances are
executed, and (ii) negligible communication latency between them. We defer to
future work the extension of the model for a more realistic distributed setting.
1
    We define the response time as the maximal source-sink total processing latency over
    the application DAG.
Towards Decentralized Auto-Scaling Policies for DSP Applications                         49

System Model In the considered system, reconfiguration decisions are taken
periodically. Therefore, we consider a slotted time system with fixed-length
intervals of length 𝛥𝑡, with the 𝑖-th time slot corresponding to the time interval
[𝑖𝛥𝑡, (𝑖 + 1)𝛥𝑡]. We denote by 𝑘𝑜𝑝,𝑖 ∈ [1, 𝐾𝑚𝑎𝑥 ] the number of parallel instances
at the beginning of slot 𝑖 for operator 𝑜𝑝, and by 𝜆𝑜𝑝,𝑖 its average input rate
measured during the previous slot. Additionally, we use 𝛬𝑖 to denote the overall
application input rate (i.e., the total data sources emission rate). At the beginning
of slot 𝑖, a decision 𝑎𝑖 is made on whether reconfiguring each operator.
    We first consider a centralized model, in which the reconfiguration decisions
are taken by the Application Manager; then, at the end of the section, we
consider the case in which the responsibility of making scaling decisions is
decentralized, and each Operator Manager acts as an independent agent. In both
the cases, we formalize the resulting problem as a discrete-time Markov Decision
Process (MDP).


Centralized Elasticity Problem An MDP is defined by a 5-tuple ⟨𝒮, 𝒜, 𝑝, 𝑐, 𝛾⟩,
where 𝒮 is a finite set of states, 𝒜(𝑠) a finite set of actions for each state 𝑠, 𝑝(𝑠′ |𝑠, 𝑎)
are the state transition probabilities, 𝑐(𝑠, 𝑎) is the cost when action 𝑎 is executed
in state 𝑠, and 𝛾 ∈ [0, 1] is a future cost discounting factor.
    We define the state of the system at time 𝑖 as 𝑠𝑖 = (𝛬𝑖 , 𝑘1,𝑖 , 𝑘2,𝑖 , . . . , 𝑘𝑁,𝑖 ).
For the sake of analysis, we discretize the arrival rate 𝛬𝑖 by assuming that
          ¯ . . . , 𝐿𝛬}
𝛬𝑖 ∈ {0, 𝛬,          ¯ where 𝛬¯ is a suitable quantum. For each state 𝑠, the action set
is 𝒜(𝑠) = 𝒜1 (𝑠) × · · · × 𝒜𝑁 (𝑠), where, for each operator 𝑜𝑝, 𝒜𝑜𝑝 (𝑠) = {+1, −1, 0}
(except for the boundary cases with minimum or maximum replication).
    System state transitions occur as a consequence of scaling decisions and arrival
rate variations. It is easy to realize that the system dynamic comprises a stochastic
component due to the exogenous rate variation, and a deterministic component
due to the fact that, given action 𝑎 and the current number of instances, we can
readily determine the next number of instances. An example of a system state
transition is illustrated in Fig. 1.
    To each state pair we associate a cost 𝑐(𝑠, 𝑎) that captures the cost of operating
the system in state 𝑠 and carrying out action 𝑎, including:

 1. the resource cost 𝑐𝑟𝑒𝑠 (𝑠, 𝑎), required for running (𝑘𝑜𝑝 + 𝑎𝑜𝑝 ) instances for each
    operator 𝑜𝑝, assuming a fixed cost per instance;
 2. the reconfiguration cost 𝑐𝑟𝑐𝑓 (𝑎), which accounts for the application downtime,
    assuming a constant reconfiguration penalty;
 3. the SLA violation cost 𝑐𝑆𝐿𝐴 (𝑠, 𝑎), which captures the penalty incurred when-
    ever the response time 𝑇 (𝑠, 𝑎) violates the threshold 𝑇𝑆𝐿𝐴 .

We define the cost function 𝑐(𝑠, 𝑎) as the weighted sum of the normalized terms:
                     ∑︀𝑁
                       𝑜=1 𝑘𝑜𝑝 + 𝑎𝑜𝑝
    𝑐(𝑠, 𝑎) = 𝑤𝑟𝑒𝑠                     + 𝑤𝑟𝑐𝑓 1{∃𝑜:𝑎𝑜 ̸=0} + 𝑤𝑆𝐿𝐴 1{𝑇 (𝑠,𝑎)>𝑇𝑆𝐿𝐴 }       (1)
                        𝑁 𝐾𝑚𝑎𝑥
where 𝑤𝑟𝑒𝑠 , 𝑤𝑟𝑐𝑓 and 𝑤𝑆𝐿𝐴 , 𝑤𝑟𝑒𝑠 + 𝑤𝑟𝑐𝑓 + 𝑤𝑆𝐿𝐴 = 1, are non negative weights.
50       Gabriele Russo Russo


Figure 1: Example of a state transition in the centralized MDP model. At time
𝑖, the application input rate is 𝛬𝑖 and the components run a single instance,
except for the second one which run two. The ApplicationManager picks action
(0, +1, +1, 0) at time 𝑖, thus adding an instance of the second and the third
operator. The resulting parallelism degree of the operators at time 𝑖 + 1 is,
respectively, 1, 3, 2, and 1. The input rate at time 𝑖 + 1 is 𝛬𝑖+1 , which obviously
does not depend on 𝑎𝑖 .


Decentralized Elasticity Problem In the decentralized adaptation scenario,
we assume that each Operator Manager independently acts on its associated
operator, having only a local view of the system. We again rely on MDP to
formalize the cost minimization problem for each agent (i.e., the Operator
Managers). Omitting the reference to the specific operator, we define the state at
time 𝑖 as the pair 𝑠𝑖 = (𝜆𝑖 , 𝑘𝑖 ), where 𝜆𝑖 is discretized using a suitable quantum
for each operator. The action set is simply 𝒜(𝑠) = {+1, −1, 0} (except for the
boundary cases with minimum or maximum replication).
     Because the agents have not a global view of the application, they can only
optimize local metrics, and thus we have to formulate a new local cost function
𝑐′ (𝑠, 𝑎). We replace the SLA violation penalty with one based on the operator
utilization 𝑈 (𝑠, 𝑎) and a target utilization 𝑈 ¯ . We get:

                                 𝑘+𝑎
              𝑐′ (𝑠, 𝑎) = 𝑤𝑟𝑒𝑠        + 𝑤𝑟𝑐𝑓 1{𝑎̸=0} + 𝑤𝑢𝑡𝑖𝑙 1{𝑈 (𝑠,𝑎)>𝑈¯ }         (2)
                                 𝐾𝑚𝑎𝑥
where 𝑤𝑟𝑒𝑠 , 𝑤𝑟𝑐𝑓 and 𝑤𝑢𝑡𝑖𝑙 , 𝑤𝑟𝑒𝑠 + 𝑤𝑟𝑐𝑓 + 𝑤𝑢𝑡𝑖𝑙 = 1, are non negative weights.


3    Learning an Optimal Policy

A policy is a function 𝜋 that associates each state 𝑠 with the action 𝑎 to choose. For
a given policy 𝜋, let 𝑉 𝜋 (𝑠) be the value function, i.e., the expected infinite-horizon
Towards Decentralized Auto-Scaling Policies for DSP Applications                              51

discounted cost starting from 𝑠. It is also convenient to define the action-value
function 𝑄𝜋 : 𝒮 × 𝐴 → ℜ which is the expected discounted cost achieved by
taking action 𝑎 in state 𝑠 and then following the policy 𝜋:
                                       ∑︁
               𝑄𝜋 (𝑠, 𝑎) = 𝑐(𝑠, 𝑎) + 𝛾    𝑝(𝑠′ |𝑠, 𝑎)𝑉 𝜋 (𝑠′ ), ∀𝑠 ∈ 𝒮        (3)
                                               𝑠′ ∈𝒮

    It is easy to realize that the value function 𝑉 and the 𝑄 function are closely
related in that 𝑉 𝜋 (𝑠′ ) = min𝑎∈𝐴(𝑠) 𝑄𝜋 (𝑠′ , 𝑎), ∀𝑠 ∈ 𝒮. More importantly, the
knowledge of the Q function is fundamental in that it directly provides the
associated policy: for a given function 𝑄, the corresponding policy is 𝜋(𝑠) =
arg min𝑎∈𝐴(𝑠) 𝑄(𝑠, 𝑎), ∀𝑠 ∈ 𝒮. We search for the optimal MDP policy 𝜋 * , which
satisfies the Bellman optimality equation:
                             {︃                               }︃
                                          ∑︁
            𝑉 (𝑠) = min 𝑐(𝑠, 𝑎) + 𝛾
             𝜋*
                                              𝑝(𝑠 |𝑠, 𝑎)𝑉 (𝑠 ) , ∀𝑠 ∈ 𝒮
                                                  ′      𝜋* ′
                                                                               (4)
                         𝑎∈𝒜(𝑠)
                                                 𝑠′ ∈𝒮

    In the ideal situation, we have full knowledge of the system, and we can
directly compute 𝜋 * using the Value Iteration algorithm [7]. In more realistic
cases, we have only partial knowledge of the underlying system model (e.g.,
the workload distribution is usually unknown). We can resort to Reinforcement
Learning (RL) approaches, which are characterized by the basic principle of
learning the optimal policy by direct interaction with the system. In particular,
we consider a model based RL algorithm that, at each time step, improves its
estimates of the unknown system parameters, and performs an iteration of the
Value Iteration algorithm (see Algorithm 1). Simpler model-free RL algorithms
like Q-learning have been shown to achieve bad performance even on smaller
tasks [3].


Algorithm 1 RL based Elastic Control Algorithm
 1: Initialize the action-value function 𝑄
 2: loop
 3:     choose an action 𝑎𝑖 (based on current estimates of Q)
 4:     observe the next state 𝑠𝑖+1 and the incurred cost 𝑐𝑖
 5:     update the unknown system parameters estimates
 6:     for all 𝑠 ∈ 𝒮 do
 7:         for all 𝑎 ∈ 𝐴(𝑠) do           ∑︀
 8:             𝑄𝑖 (𝑠, 𝑎) ← 𝑐^𝑖 (𝑠, 𝑎) + 𝛾 𝑠′ ∈𝒮 𝑝^(𝑠′ |𝑠, 𝑎) min𝑎′ ∈𝐴(𝑠′ ) 𝑄𝑖−1 (𝑠′ , 𝑎′ )
 9:         end for
10:     end for
11: end loop


    We first consider the case in which the operator response time model is
known, and let the algorithm learn the state transition probabilities.. In order to
estimate 𝑝(𝑠′ |𝑠, 𝑎), it suffices to estimate the input rate transition probabilities
52        Gabriele Russo Russo

𝑃 [𝜆𝑖+1 = 𝜆′ |𝜆𝑖 = 𝜆]2 , since the dynamics related to the number of instances are
known and deterministic. Hereafter, since 𝜆 takes value in a discrete set, we will
write 𝑃𝑗,𝑗 ′ = 𝑃 [𝜆𝑖+1 = 𝑗 ′ 𝜆|𝜆   ¯ 𝑖 = 𝑗 𝜆],
                                           ¯ 𝑗, 𝑗 ′ ∈ {0, . . . , 𝐿} for short. Let 𝑛𝑖,𝑗𝑗 ′ be the
number of times the arrival rate changes from state 𝑗 𝜆                ¯ to 𝑗 ′ 𝜆,
                                                                                ¯ in the interval
{1, . . . , 𝑖}, 𝑗, 𝑗 ∈ {1, . . . , 𝐿}. At time 𝑖 the transition probabilities estimates are
                    ′

                                                  𝑛𝑖,𝑗𝑗 ′
                                        𝑗,𝑗 ′ = ∑︀𝐿
                                      𝑃̂︂                                                     (5)
                                                  𝑙=0 𝑛𝑖,𝑗𝑙

    If we remove the assumption on the known response time model, we have to
estimate the cost 𝑐(𝑠, 𝑎) as well, because we cannot predict the SLA/utilization
violation any more. So, we split 𝑐(𝑠, 𝑎) and 𝑐′ (𝑠, 𝑎), respectively defined in (1) and
(2), into known and unknown terms: the known term 𝑐𝑘 (𝑠, 𝑎) accounts for the
reconfiguration cost and the resources cost, whereas the unknown cost 𝑐𝑢 (𝑠, 𝑎)
represents the SLA (or utilization) violation penalty. We use a simple exponential
weighted average for estimating the unknown cost:

                         𝑐^𝑢,𝑖 (𝑠, 𝑎) ← (1 − 𝛼)^
                                               𝑐𝑢,𝑖−1 (𝑠, 𝑎) + 𝛼𝑐𝑢,𝑖                          (6)

where 𝑐𝑖 , 𝑢 = 𝑤𝑆𝐿𝐴 (or 𝑤𝑢𝑡𝑖𝑙 ) if a violation occured a time 𝑖 and 0 otherwise.
    As regards the complexity of the algorithm, the size of the state-action space is
critical, since each learning iteration requires 𝑂(|𝒮|2 |𝒜|2 ) operations. We observe
that in the centralized model |𝒮| and |𝒜| grow exponentially with the number of
operators 𝑁 , whereas they are not influenced by 𝑁 in the decentralized model.


4    Evaluation
We evaluate by simulation the presented models, and compare the policies learned
through RL to the MDP optimal one. In order to explicitly solve the MDP, we
need a state transition probability matrix, which is not available in practical
scenarios. Thus, for evaluation, we consider a dataset made available by Chris
Whong3 that contains information about taxis activity, and extract a state
transition probability applying (5). We then evaluate the proposed solutions on
a new workload, generated according to those probabilities.
    For simplicity, we consider a pipeline application, composed of a data source
and up to 4 operators. Each operator runs at most 𝐾𝑚𝑎𝑥 = 5 instances, each
behaving as a M/D/1 queue with service rate 𝜇𝑜𝑝 . For evaluation, we consider a
scenario with slightly different service rates, and set 𝜇1 = 3.7, 𝜇2 = 𝜇4 = 3.3, and
𝜇3 = 2.7 tuple/s. Because of space limitation, we defer the evaluation of real world
topologies to future work. We consider 𝛥𝑡 = 1 min, and aggregate the events
in the dataset over one minute windows. We assume 𝛬𝑖 = 𝜆𝑜,𝑖 , ∀𝑜, discretized
     ¯ = 20 tuple/min. For the cost function, we set 𝑤𝑠𝑙𝑎 = 𝑤𝑢𝑡𝑖𝑙 = 𝑤𝑟𝑐𝑓 = 0.4,
with 𝜆
𝑤𝑟𝑒𝑠 = 0.2, 𝑇𝑆𝐿𝐴 = 650 ms, and 𝑈       ¯ ∈ {0.6, 0.7, 0.8}. As regards the learning
2
   To simplify notation, we simply use 𝜆 to denote the input rate. In the centralized
  model, we use the same estimates for the total application input rate 𝛬.
3
  http://chriswhong.com/open-data/foil_nyc_taxi/
Towards Decentralized Auto-Scaling Policies for DSP Applications                53


Table 1: Results for a 3-operators application. The “+” sign denotes the knowledge
of the response time model.
                                Avg.       SLA               Avg.
      Scaling Policy            Cost Violations Reconf. Instances
      Centralized MDP           0.163       8903  13882      10.95
      Centralized RL+           0.164       9505  13684      10.94
      Centralized RL            0.167      15681  14579      10.79
                     ¯ = 0.6)
      Decentralized (𝑈          0.178       3639  30104      11.46
                     ¯ = 0.7)
      Decentralized (𝑈          0.173      17111  30404      10.30
                     ¯ = 0.8)
      Decentralized (𝑈          0.205      79670  29681       9.15


algorithm, we set 𝛾 = 0.99, and 𝛼 = 0.1. We compare the results obtained by
solving the MDP to those achieved by the centralized RL algorithm (with and
without the known response time model) and by the decentralized solution.
    In Table 1 we report the results for a 3-operators topology. As expected, the
minimum average cost is achieved solving the MDP; interestingly, the centralized
RL solution incurs almost negligible performance degradation, and the gap with
the decentralized approach is not significant as well. However, we note that the
performance of the decentralized solution depends on the target utilization 𝑈    ¯,
which has still to be set manually in our work. Setting a too high (or too low)
value results in a different trade-off between SLA violations and used instances,
with negative effects on the overall cost as well. The decentralized solution shows
a higher number of reconfigurations, due to the lack of coordination between
the agents. As illustrated in Fig. 2a, the convergence velocity of the different
solutions is similar, except for the centralized RL algorithm. In absence of the
response time model, the algorithm is indeed significantly slower to learn than
the other solutions. When the response time model is known, the algorithm
converges much faster, despite the large state-action space.
    We also compare the decentralized approach to the MDP varying the number
of operators in the application. As shown in Fig. 2b, the cost gap between the two
solutions slightly increases as the application gets more complex. However, we
observe that the decentralized algorithm has not scalability issues as the number
of operators increases, while solving a centralized problem gets easily impractical.


5    Conclusion

In this paper we have formalized the elasticity problem for DSP applications
as a Markov Decision Process, and proposed a Reinforcement Learning based
solution to cope with the limited knowledge of the system model. Our numerical
evaluation shows promising results even for a fully decentralized solution which,
leveraging the available knowledge about the system, does not suffer from the
extremely slow convergence of model-free RL algorithms. In practical scenarios,
we could also combine the proposed solution with a simple threshold-based policy
54                        Gabriele Russo Russo


                0.5                                                  0.22
                                                    MDP
                                              Centr. RL+
                                               Centr. RL              0.2
                0.4                  Decentr. RL (U=0.7)
Avgerage cost                                                        0.18

                0.3                                                  0.16

                                                                     0.14
                0.2                                                                              MDP
                                                                     0.12          Decentr. RL (U=0.6)
                                                                                   Decentr. RL (U=0.7)
                                                                                   Decentr. RL (U=0.8)
                0.1                                                   0.1
                      1          7           30 60         180 365          1    2             3         4
                                     Time (days)                                Number of operators

                                     (a)                                            (b)

Figure 2: Average cost during one simulated year for a 3-operators application (a),
and for different number of operators (b). The “+” sign denotes the knowledge
of the response time model.


to be used at the beginning, while the agents learn a good policy to be adopted
in the following.
    For future work, our goal is twofold. We plan to improve the decentralized
learning algorithm exploring RL techniques specifically targeted to multi-agent
systems. At the same time, we will extend the model to cope with a more
complex and realistic scenario, considering, e.g., resource heterogeneity and
network latency in distributed deployments.


References
1. Assuncao, M.D., da Silva Veith, A., Buyya, R.: Distributed data stream processing
   and edge computing: A survey on resource elasticity and future directions. Journal
   of Network and Computer Applications 103, 1 – 17 (2018)
2. Cardellini, V., Lo Presti, F., Nardelli, M., Russo Russo, G.: Optimal operator deploy-
   ment and replication for elastic distributed data stream processing. Concurrency and
   Computation: Practice and Experience (2017), http://dx.doi.org/10.1002/cpe.4334
3. Cardellini, V., Lo Presti, F., Nardelli, M., Russo Russo, G.: Auto-scaling in data
   stream processing applications: A model based reinforcement learning approach. In:
   Proc. of InfQ ’17 (in conjunction with VALUETOOLS ’17) (2018, to appear)
4. Heinze, T., Pappalardo, V., Jerzak, Z., Fetzer, C.: Auto-scaling techniques for elastic
   data stream processing. In: Proc. IEEE ICDEW ’14. pp. 296–302 (2014)
5. Hirzel, M., Soulé, R., Schneider, S., Gedik, B., Grimm, R.: A catalog of stream
   processing optimizations. ACM Comput. Surv. 46(4) (Mar 2014)
6. Lombardi, F., Aniello, L., Bonomi, S., Querzoni, L.: Elastic symbiotic scaling of
   operators and resources in stream processing systems. IEEE Transactions on Parallel
   and Distributed Systems PP(99), 1–1 (2017)
7. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press,
   Cambridge (1998)