Towards Decentralized Auto-Scaling Policies for Data Stream Processing Applications Gabriele Russo Russo Department of Civil Engineering and Computer Science Engineering University of Rome Tor Vergata, Italy russo.russo@ing.uniroma2.it Abstract Data Stream Processing applications can process large data volumes in near real-time. In order to face varying workloads in a scalable and cost-effective manner, it is critical to adjust the application parallelism at run-time. We formulate the elasticity problem as a Markov Decision Process (MDP). As the MDP resolution requires full knowledge of the system dynamics, which is rarely available, we rely on model based Reinforcement Learning to improve the scaling policy at run-time. We show promising results even for a decentralized approach, compared to the optimal MDP solution. Keywords: Data Stream Processing, Elasticity, Reinforcement Learning 1 Introduction New emerging application scenarios (e.g., social analytics, fraud detection, Smart City) leverage Data Stream Processing (DSP) to process data streams in near real-time. A DSP application is usually represented as a directed acyclic graph (DAG), with data sources and operators as vertices, and streams as edges [5]. Each operator continuously receives data (e.g, tuples), applies a transformation, and generates new outgoing streams. A commonly adopted DSP optimization is data parallelism, which consists of scaling-in/out the parallel instances of the operators, so that each processes a portion of the incoming data (at the cost of using more resources) [5]. Due to the unpredictable and variable rate at which the sources produce the streams, a key feature for DSP systems is the capability of elastically adjusting the parallelism at run-time. Most of the existing DSP frameworks allow to allocate more than one replica per operator, but their support for the run-time reconfiguration is quite limited, as regards both the mechanisms and the policies. In this paper, we focus on the auto-scaling policies. We formalize the elasticity problem for a DSP application as a Markov Decision Process (MDP), presenting both centralized and decentralized formulations. Unfortunately, in practice the optimal MDP policy cannot be determined, because several system dynamics may be unknown. To cope with the model uncertainty, we rely on Reinforcement Learning (RL) approaches, which learn the optimal MDP policy on-line by N. Herzberg, C. Hochreiner, O. Kopp, J. Lenhard (Eds.): 10th ZEUS Workshop, ZEUS 2018, Dresden, Germany, 8-9 February 2018, published at http://ceur-ws.org/Vol-2072 48 Gabriele Russo Russo interacting with the system. Specifically, we present a model based RL solution that leverages the partial knowledge of the model to speedup the learning process. Elasticity for DSP is attracting many research efforts [1], with most approaches relying on heuristics to determine the scaling decisions. An optimization model that also considers the operator placement problem has been presented in [2], but it cannot be easily solved in a decentralized manner. Here we describe a simpler model, for which we can derive a decentralized formulation. The application of RL techniques to DSP elasticity is quite limited. Heinze et al. [4] propose a simple RL approach to control the system utilization, but they focus on infrastructure-level elasticity. Lombardi et al. [6] exploit RL in their elasticity framework as well, but the learning algorithm is only used for thresholds tuning. In [3] different RL algorithms have been compared for solving the elasticity problem for a single DSP operator in isolation, while in this work we consider whole applications. In the rest of this paper, we first formulate the elasticity problem as an MDP in Sect. 2, presenting in Sect. 3 the RL based algorithm for learning the scaling policy; we evaluate the proposed solutions in Sect. 4, and conclude in Sect. 5. 2 Problem Formulation In this paper, we consider the elasticity problem for a DSP application composed of ๐‘ operators. Each operator is possibly replicated into a number of instances and, without lack of generality, we assume even distribution of the incoming data among the parallel instances. For each operator, an Operator Manager monitors the operator functionality, while an Application Manager supervises the whole application. The number of parallel instances used by each operator is adjusted either by its Operator Manager (decentralized adaptation) or by the Application Manager (centralized adaptation). At each decision step, for each operator we can add an instance, terminate one, or keep the current parallelism. Following a scaling decision, the operator is subject to a reconfiguration process; as the integrity of the streams and the operator internal state must be preserved, the whole application is usually paused during the process, leading to downtime [2]. Our goal is to take reconfiguration decisions as to minimize a long-term cost function which accounts for the downtime and for the monetary cost to run the application. The latter comprises (i) the cost of the instances allocated for the next decision period, and (ii) a penalty in case of a Service Level Agreement (SLA) violation. In particular, we consider a constraint on the application response time1 , so that a penalty is paid every time the response time exceeds a given threshold. In order to keep the system model simple, we consider a deployment scenario with (i) homogeneous computing resources on which the operator instances are executed, and (ii) negligible communication latency between them. We defer to future work the extension of the model for a more realistic distributed setting. 1 We define the response time as the maximal source-sink total processing latency over the application DAG. Towards Decentralized Auto-Scaling Policies for DSP Applications 49 System Model In the considered system, reconfiguration decisions are taken periodically. Therefore, we consider a slotted time system with fixed-length intervals of length ๐›ฅ๐‘ก, with the ๐‘–-th time slot corresponding to the time interval [๐‘–๐›ฅ๐‘ก, (๐‘– + 1)๐›ฅ๐‘ก]. We denote by ๐‘˜๐‘œ๐‘,๐‘– โˆˆ [1, ๐พ๐‘š๐‘Ž๐‘ฅ ] the number of parallel instances at the beginning of slot ๐‘– for operator ๐‘œ๐‘, and by ๐œ†๐‘œ๐‘,๐‘– its average input rate measured during the previous slot. Additionally, we use ๐›ฌ๐‘– to denote the overall application input rate (i.e., the total data sources emission rate). At the beginning of slot ๐‘–, a decision ๐‘Ž๐‘– is made on whether reconfiguring each operator. We first consider a centralized model, in which the reconfiguration decisions are taken by the Application Manager; then, at the end of the section, we consider the case in which the responsibility of making scaling decisions is decentralized, and each Operator Manager acts as an independent agent. In both the cases, we formalize the resulting problem as a discrete-time Markov Decision Process (MDP). Centralized Elasticity Problem An MDP is defined by a 5-tuple โŸจ๐’ฎ, ๐’œ, ๐‘, ๐‘, ๐›พโŸฉ, where ๐’ฎ is a finite set of states, ๐’œ(๐‘ ) a finite set of actions for each state ๐‘ , ๐‘(๐‘ โ€ฒ |๐‘ , ๐‘Ž) are the state transition probabilities, ๐‘(๐‘ , ๐‘Ž) is the cost when action ๐‘Ž is executed in state ๐‘ , and ๐›พ โˆˆ [0, 1] is a future cost discounting factor. We define the state of the system at time ๐‘– as ๐‘ ๐‘– = (๐›ฌ๐‘– , ๐‘˜1,๐‘– , ๐‘˜2,๐‘– , . . . , ๐‘˜๐‘,๐‘– ). For the sake of analysis, we discretize the arrival rate ๐›ฌ๐‘– by assuming that ยฏ . . . , ๐ฟ๐›ฌ} ๐›ฌ๐‘– โˆˆ {0, ๐›ฌ, ยฏ where ๐›ฌยฏ is a suitable quantum. For each state ๐‘ , the action set is ๐’œ(๐‘ ) = ๐’œ1 (๐‘ ) ร— ยท ยท ยท ร— ๐’œ๐‘ (๐‘ ), where, for each operator ๐‘œ๐‘, ๐’œ๐‘œ๐‘ (๐‘ ) = {+1, โˆ’1, 0} (except for the boundary cases with minimum or maximum replication). System state transitions occur as a consequence of scaling decisions and arrival rate variations. It is easy to realize that the system dynamic comprises a stochastic component due to the exogenous rate variation, and a deterministic component due to the fact that, given action ๐‘Ž and the current number of instances, we can readily determine the next number of instances. An example of a system state transition is illustrated in Fig. 1. To each state pair we associate a cost ๐‘(๐‘ , ๐‘Ž) that captures the cost of operating the system in state ๐‘  and carrying out action ๐‘Ž, including: 1. the resource cost ๐‘๐‘Ÿ๐‘’๐‘  (๐‘ , ๐‘Ž), required for running (๐‘˜๐‘œ๐‘ + ๐‘Ž๐‘œ๐‘ ) instances for each operator ๐‘œ๐‘, assuming a fixed cost per instance; 2. the reconfiguration cost ๐‘๐‘Ÿ๐‘๐‘“ (๐‘Ž), which accounts for the application downtime, assuming a constant reconfiguration penalty; 3. the SLA violation cost ๐‘๐‘†๐ฟ๐ด (๐‘ , ๐‘Ž), which captures the penalty incurred when- ever the response time ๐‘‡ (๐‘ , ๐‘Ž) violates the threshold ๐‘‡๐‘†๐ฟ๐ด . We define the cost function ๐‘(๐‘ , ๐‘Ž) as the weighted sum of the normalized terms: โˆ‘๏ธ€๐‘ ๐‘œ=1 ๐‘˜๐‘œ๐‘ + ๐‘Ž๐‘œ๐‘ ๐‘(๐‘ , ๐‘Ž) = ๐‘ค๐‘Ÿ๐‘’๐‘  + ๐‘ค๐‘Ÿ๐‘๐‘“ 1{โˆƒ๐‘œ:๐‘Ž๐‘œ ฬธ=0} + ๐‘ค๐‘†๐ฟ๐ด 1{๐‘‡ (๐‘ ,๐‘Ž)>๐‘‡๐‘†๐ฟ๐ด } (1) ๐‘ ๐พ๐‘š๐‘Ž๐‘ฅ where ๐‘ค๐‘Ÿ๐‘’๐‘  , ๐‘ค๐‘Ÿ๐‘๐‘“ and ๐‘ค๐‘†๐ฟ๐ด , ๐‘ค๐‘Ÿ๐‘’๐‘  + ๐‘ค๐‘Ÿ๐‘๐‘“ + ๐‘ค๐‘†๐ฟ๐ด = 1, are non negative weights. 50 Gabriele Russo Russo Figure 1: Example of a state transition in the centralized MDP model. At time ๐‘–, the application input rate is ๐›ฌ๐‘– and the components run a single instance, except for the second one which run two. The ApplicationManager picks action (0, +1, +1, 0) at time ๐‘–, thus adding an instance of the second and the third operator. The resulting parallelism degree of the operators at time ๐‘– + 1 is, respectively, 1, 3, 2, and 1. The input rate at time ๐‘– + 1 is ๐›ฌ๐‘–+1 , which obviously does not depend on ๐‘Ž๐‘– . Decentralized Elasticity Problem In the decentralized adaptation scenario, we assume that each Operator Manager independently acts on its associated operator, having only a local view of the system. We again rely on MDP to formalize the cost minimization problem for each agent (i.e., the Operator Managers). Omitting the reference to the specific operator, we define the state at time ๐‘– as the pair ๐‘ ๐‘– = (๐œ†๐‘– , ๐‘˜๐‘– ), where ๐œ†๐‘– is discretized using a suitable quantum for each operator. The action set is simply ๐’œ(๐‘ ) = {+1, โˆ’1, 0} (except for the boundary cases with minimum or maximum replication). Because the agents have not a global view of the application, they can only optimize local metrics, and thus we have to formulate a new local cost function ๐‘โ€ฒ (๐‘ , ๐‘Ž). We replace the SLA violation penalty with one based on the operator utilization ๐‘ˆ (๐‘ , ๐‘Ž) and a target utilization ๐‘ˆ ยฏ . We get: ๐‘˜+๐‘Ž ๐‘โ€ฒ (๐‘ , ๐‘Ž) = ๐‘ค๐‘Ÿ๐‘’๐‘  + ๐‘ค๐‘Ÿ๐‘๐‘“ 1{๐‘Žฬธ=0} + ๐‘ค๐‘ข๐‘ก๐‘–๐‘™ 1{๐‘ˆ (๐‘ ,๐‘Ž)>๐‘ˆยฏ } (2) ๐พ๐‘š๐‘Ž๐‘ฅ where ๐‘ค๐‘Ÿ๐‘’๐‘  , ๐‘ค๐‘Ÿ๐‘๐‘“ and ๐‘ค๐‘ข๐‘ก๐‘–๐‘™ , ๐‘ค๐‘Ÿ๐‘’๐‘  + ๐‘ค๐‘Ÿ๐‘๐‘“ + ๐‘ค๐‘ข๐‘ก๐‘–๐‘™ = 1, are non negative weights. 3 Learning an Optimal Policy A policy is a function ๐œ‹ that associates each state ๐‘  with the action ๐‘Ž to choose. For a given policy ๐œ‹, let ๐‘‰ ๐œ‹ (๐‘ ) be the value function, i.e., the expected infinite-horizon Towards Decentralized Auto-Scaling Policies for DSP Applications 51 discounted cost starting from ๐‘ . It is also convenient to define the action-value function ๐‘„๐œ‹ : ๐’ฎ ร— ๐ด โ†’ โ„œ which is the expected discounted cost achieved by taking action ๐‘Ž in state ๐‘  and then following the policy ๐œ‹: โˆ‘๏ธ ๐‘„๐œ‹ (๐‘ , ๐‘Ž) = ๐‘(๐‘ , ๐‘Ž) + ๐›พ ๐‘(๐‘ โ€ฒ |๐‘ , ๐‘Ž)๐‘‰ ๐œ‹ (๐‘ โ€ฒ ), โˆ€๐‘  โˆˆ ๐’ฎ (3) ๐‘ โ€ฒ โˆˆ๐’ฎ It is easy to realize that the value function ๐‘‰ and the ๐‘„ function are closely related in that ๐‘‰ ๐œ‹ (๐‘ โ€ฒ ) = min๐‘Žโˆˆ๐ด(๐‘ ) ๐‘„๐œ‹ (๐‘ โ€ฒ , ๐‘Ž), โˆ€๐‘  โˆˆ ๐’ฎ. More importantly, the knowledge of the Q function is fundamental in that it directly provides the associated policy: for a given function ๐‘„, the corresponding policy is ๐œ‹(๐‘ ) = arg min๐‘Žโˆˆ๐ด(๐‘ ) ๐‘„(๐‘ , ๐‘Ž), โˆ€๐‘  โˆˆ ๐’ฎ. We search for the optimal MDP policy ๐œ‹ * , which satisfies the Bellman optimality equation: {๏ธƒ }๏ธƒ โˆ‘๏ธ ๐‘‰ (๐‘ ) = min ๐‘(๐‘ , ๐‘Ž) + ๐›พ ๐œ‹* ๐‘(๐‘  |๐‘ , ๐‘Ž)๐‘‰ (๐‘  ) , โˆ€๐‘  โˆˆ ๐’ฎ โ€ฒ ๐œ‹* โ€ฒ (4) ๐‘Žโˆˆ๐’œ(๐‘ ) ๐‘ โ€ฒ โˆˆ๐’ฎ In the ideal situation, we have full knowledge of the system, and we can directly compute ๐œ‹ * using the Value Iteration algorithm [7]. In more realistic cases, we have only partial knowledge of the underlying system model (e.g., the workload distribution is usually unknown). We can resort to Reinforcement Learning (RL) approaches, which are characterized by the basic principle of learning the optimal policy by direct interaction with the system. In particular, we consider a model based RL algorithm that, at each time step, improves its estimates of the unknown system parameters, and performs an iteration of the Value Iteration algorithm (see Algorithm 1). Simpler model-free RL algorithms like Q-learning have been shown to achieve bad performance even on smaller tasks [3]. Algorithm 1 RL based Elastic Control Algorithm 1: Initialize the action-value function ๐‘„ 2: loop 3: choose an action ๐‘Ž๐‘– (based on current estimates of Q) 4: observe the next state ๐‘ ๐‘–+1 and the incurred cost ๐‘๐‘– 5: update the unknown system parameters estimates 6: for all ๐‘  โˆˆ ๐’ฎ do 7: for all ๐‘Ž โˆˆ ๐ด(๐‘ ) do โˆ‘๏ธ€ 8: ๐‘„๐‘– (๐‘ , ๐‘Ž) โ† ๐‘^๐‘– (๐‘ , ๐‘Ž) + ๐›พ ๐‘ โ€ฒ โˆˆ๐’ฎ ๐‘^(๐‘ โ€ฒ |๐‘ , ๐‘Ž) min๐‘Žโ€ฒ โˆˆ๐ด(๐‘ โ€ฒ ) ๐‘„๐‘–โˆ’1 (๐‘ โ€ฒ , ๐‘Žโ€ฒ ) 9: end for 10: end for 11: end loop We first consider the case in which the operator response time model is known, and let the algorithm learn the state transition probabilities.. In order to estimate ๐‘(๐‘ โ€ฒ |๐‘ , ๐‘Ž), it suffices to estimate the input rate transition probabilities 52 Gabriele Russo Russo ๐‘ƒ [๐œ†๐‘–+1 = ๐œ†โ€ฒ |๐œ†๐‘– = ๐œ†]2 , since the dynamics related to the number of instances are known and deterministic. Hereafter, since ๐œ† takes value in a discrete set, we will write ๐‘ƒ๐‘—,๐‘— โ€ฒ = ๐‘ƒ [๐œ†๐‘–+1 = ๐‘— โ€ฒ ๐œ†|๐œ† ยฏ ๐‘– = ๐‘— ๐œ†], ยฏ ๐‘—, ๐‘— โ€ฒ โˆˆ {0, . . . , ๐ฟ} for short. Let ๐‘›๐‘–,๐‘—๐‘— โ€ฒ be the number of times the arrival rate changes from state ๐‘— ๐œ† ยฏ to ๐‘— โ€ฒ ๐œ†, ยฏ in the interval {1, . . . , ๐‘–}, ๐‘—, ๐‘— โˆˆ {1, . . . , ๐ฟ}. At time ๐‘– the transition probabilities estimates are โ€ฒ ๐‘›๐‘–,๐‘—๐‘— โ€ฒ ๐‘—,๐‘— โ€ฒ = โˆ‘๏ธ€๐ฟ ๐‘ƒฬ‚๏ธ‚ (5) ๐‘™=0 ๐‘›๐‘–,๐‘—๐‘™ If we remove the assumption on the known response time model, we have to estimate the cost ๐‘(๐‘ , ๐‘Ž) as well, because we cannot predict the SLA/utilization violation any more. So, we split ๐‘(๐‘ , ๐‘Ž) and ๐‘โ€ฒ (๐‘ , ๐‘Ž), respectively defined in (1) and (2), into known and unknown terms: the known term ๐‘๐‘˜ (๐‘ , ๐‘Ž) accounts for the reconfiguration cost and the resources cost, whereas the unknown cost ๐‘๐‘ข (๐‘ , ๐‘Ž) represents the SLA (or utilization) violation penalty. We use a simple exponential weighted average for estimating the unknown cost: ๐‘^๐‘ข,๐‘– (๐‘ , ๐‘Ž) โ† (1 โˆ’ ๐›ผ)^ ๐‘๐‘ข,๐‘–โˆ’1 (๐‘ , ๐‘Ž) + ๐›ผ๐‘๐‘ข,๐‘– (6) where ๐‘๐‘– , ๐‘ข = ๐‘ค๐‘†๐ฟ๐ด (or ๐‘ค๐‘ข๐‘ก๐‘–๐‘™ ) if a violation occured a time ๐‘– and 0 otherwise. As regards the complexity of the algorithm, the size of the state-action space is critical, since each learning iteration requires ๐‘‚(|๐’ฎ|2 |๐’œ|2 ) operations. We observe that in the centralized model |๐’ฎ| and |๐’œ| grow exponentially with the number of operators ๐‘ , whereas they are not influenced by ๐‘ in the decentralized model. 4 Evaluation We evaluate by simulation the presented models, and compare the policies learned through RL to the MDP optimal one. In order to explicitly solve the MDP, we need a state transition probability matrix, which is not available in practical scenarios. Thus, for evaluation, we consider a dataset made available by Chris Whong3 that contains information about taxis activity, and extract a state transition probability applying (5). We then evaluate the proposed solutions on a new workload, generated according to those probabilities. For simplicity, we consider a pipeline application, composed of a data source and up to 4 operators. Each operator runs at most ๐พ๐‘š๐‘Ž๐‘ฅ = 5 instances, each behaving as a M/D/1 queue with service rate ๐œ‡๐‘œ๐‘ . For evaluation, we consider a scenario with slightly different service rates, and set ๐œ‡1 = 3.7, ๐œ‡2 = ๐œ‡4 = 3.3, and ๐œ‡3 = 2.7 tuple/s. Because of space limitation, we defer the evaluation of real world topologies to future work. We consider ๐›ฅ๐‘ก = 1 min, and aggregate the events in the dataset over one minute windows. We assume ๐›ฌ๐‘– = ๐œ†๐‘œ,๐‘– , โˆ€๐‘œ, discretized ยฏ = 20 tuple/min. For the cost function, we set ๐‘ค๐‘ ๐‘™๐‘Ž = ๐‘ค๐‘ข๐‘ก๐‘–๐‘™ = ๐‘ค๐‘Ÿ๐‘๐‘“ = 0.4, with ๐œ† ๐‘ค๐‘Ÿ๐‘’๐‘  = 0.2, ๐‘‡๐‘†๐ฟ๐ด = 650 ms, and ๐‘ˆ ยฏ โˆˆ {0.6, 0.7, 0.8}. As regards the learning 2 To simplify notation, we simply use ๐œ† to denote the input rate. In the centralized model, we use the same estimates for the total application input rate ๐›ฌ. 3 http://chriswhong.com/open-data/foil_nyc_taxi/ Towards Decentralized Auto-Scaling Policies for DSP Applications 53 Table 1: Results for a 3-operators application. The โ€œ+โ€ sign denotes the knowledge of the response time model. Avg. SLA Avg. Scaling Policy Cost Violations Reconf. Instances Centralized MDP 0.163 8903 13882 10.95 Centralized RL+ 0.164 9505 13684 10.94 Centralized RL 0.167 15681 14579 10.79 ยฏ = 0.6) Decentralized (๐‘ˆ 0.178 3639 30104 11.46 ยฏ = 0.7) Decentralized (๐‘ˆ 0.173 17111 30404 10.30 ยฏ = 0.8) Decentralized (๐‘ˆ 0.205 79670 29681 9.15 algorithm, we set ๐›พ = 0.99, and ๐›ผ = 0.1. We compare the results obtained by solving the MDP to those achieved by the centralized RL algorithm (with and without the known response time model) and by the decentralized solution. In Table 1 we report the results for a 3-operators topology. As expected, the minimum average cost is achieved solving the MDP; interestingly, the centralized RL solution incurs almost negligible performance degradation, and the gap with the decentralized approach is not significant as well. However, we note that the performance of the decentralized solution depends on the target utilization ๐‘ˆ ยฏ, which has still to be set manually in our work. Setting a too high (or too low) value results in a different trade-off between SLA violations and used instances, with negative effects on the overall cost as well. The decentralized solution shows a higher number of reconfigurations, due to the lack of coordination between the agents. As illustrated in Fig. 2a, the convergence velocity of the different solutions is similar, except for the centralized RL algorithm. In absence of the response time model, the algorithm is indeed significantly slower to learn than the other solutions. When the response time model is known, the algorithm converges much faster, despite the large state-action space. We also compare the decentralized approach to the MDP varying the number of operators in the application. As shown in Fig. 2b, the cost gap between the two solutions slightly increases as the application gets more complex. However, we observe that the decentralized algorithm has not scalability issues as the number of operators increases, while solving a centralized problem gets easily impractical. 5 Conclusion In this paper we have formalized the elasticity problem for DSP applications as a Markov Decision Process, and proposed a Reinforcement Learning based solution to cope with the limited knowledge of the system model. Our numerical evaluation shows promising results even for a fully decentralized solution which, leveraging the available knowledge about the system, does not suffer from the extremely slow convergence of model-free RL algorithms. In practical scenarios, we could also combine the proposed solution with a simple threshold-based policy 54 Gabriele Russo Russo 0.5 0.22 MDP Centr. RL+ Centr. RL 0.2 0.4 Decentr. RL (U=0.7) Avgerage cost 0.18 0.3 0.16 0.14 0.2 MDP 0.12 Decentr. RL (U=0.6) Decentr. RL (U=0.7) Decentr. RL (U=0.8) 0.1 0.1 1 7 30 60 180 365 1 2 3 4 Time (days) Number of operators (a) (b) Figure 2: Average cost during one simulated year for a 3-operators application (a), and for different number of operators (b). The โ€œ+โ€ sign denotes the knowledge of the response time model. to be used at the beginning, while the agents learn a good policy to be adopted in the following. For future work, our goal is twofold. We plan to improve the decentralized learning algorithm exploring RL techniques specifically targeted to multi-agent systems. At the same time, we will extend the model to cope with a more complex and realistic scenario, considering, e.g., resource heterogeneity and network latency in distributed deployments. References 1. Assuncao, M.D., da Silva Veith, A., Buyya, R.: Distributed data stream processing and edge computing: A survey on resource elasticity and future directions. Journal of Network and Computer Applications 103, 1 โ€“ 17 (2018) 2. Cardellini, V., Lo Presti, F., Nardelli, M., Russo Russo, G.: Optimal operator deploy- ment and replication for elastic distributed data stream processing. Concurrency and Computation: Practice and Experience (2017), http://dx.doi.org/10.1002/cpe.4334 3. Cardellini, V., Lo Presti, F., Nardelli, M., Russo Russo, G.: Auto-scaling in data stream processing applications: A model based reinforcement learning approach. In: Proc. of InfQ โ€™17 (in conjunction with VALUETOOLS โ€™17) (2018, to appear) 4. Heinze, T., Pappalardo, V., Jerzak, Z., Fetzer, C.: Auto-scaling techniques for elastic data stream processing. In: Proc. IEEE ICDEW โ€™14. pp. 296โ€“302 (2014) 5. Hirzel, M., Soulรฉ, R., Schneider, S., Gedik, B., Grimm, R.: A catalog of stream processing optimizations. ACM Comput. Surv. 46(4) (Mar 2014) 6. Lombardi, F., Aniello, L., Bonomi, S., Querzoni, L.: Elastic symbiotic scaling of operators and resources in stream processing systems. IEEE Transactions on Parallel and Distributed Systems PP(99), 1โ€“1 (2017) 7. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press, Cambridge (1998)