A Deep Q-learning Scaling Policy for Elastic Application Deployment Fabiana Rossi Department of Civil Engineering and Computer Science Engineering University of Rome Tor Vergata, Italy f.rossi@ing.uniroma2.it Abstract The ability of cloud computing to provide resources on demand encourages the development of elastic applications. Differently from the popular threshold-based solutions used to drive elasticity, we aim to design a flexible approach that can customize the adaptation policy without the need of manually tuning various configuration knobs. In this paper, we propose two Reinforcement Learning (RL) solutions (i.e., Q-learning and Deep Q-learning) for controlling the application elasticity. Although Q-learning represents the most popular approach, it may suffer from a possible long learning phase. To improve scalability and identify better adaptation policies, we propose Deep Q-learning, a model-free RL solution that uses a deep neural network to approximate the system dynamics. Using simulations, we show the benefits and flexibility of Deep Q-learning with respect to Q-learning in scaling applications. Keywords: Deep Q-learning · Elasticity · Reinforcement Learning · Self-adaptive systems. 1 Introduction The dynamism of working conditions calls for an elastic application deployment, which can be adapted in face of changing working conditions (e.g., incoming workload) so to meet stringent Quality of Service (QoS) requirements. Cloud providers, e.g., Amazon, Google, that support multi-component applications usually use static thresholds on system-oriented metrics to carry out the adapta- tion of each application component. As shown in [11,12], the manual tuning of such scaling thresholds is challenging, especially when we need to specify critical values on system-oriented metrics and the application exposes its requirements in terms of user-oriented metrics (e.g., response time, throughput, cost). Differently from the popular static threshold-based approaches, we aim to design a flexible policy that can adapt the application deployment, according on user-defined goals, without the need of manually tuning various configuration knobs. In this paper, we use Reinforcement Learning (RL) to adapt the application deployment. RL allows to express what the user aims to obtain, instead of how it should be obtained (as required by threshold-based policies). Most of the existing works consider model-free RL algorithms, e.g., Q-learning, to drive the application J. Manner, S. Haarmann, S. Kolb, N. Herzberg, O. Kopp (Eds.): 13th ZEUS Workshop, ZEUS 2021, Bamberg, held virtually due to Covid-19 pandemic, Germany, 25-26 February 2021, published at http://ceur-ws.org Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 56 Fabiana Rossi deployment (e.g., [2,5]). However, one of the main issue with Q-learning is the possibly long learning phase, which is especially experienced when the number of system states increases. An approach to boost the learning process is to provide the agent with system knowledge. By exploiting it, the agent can more easily find a suitable trade-off between system- and user-oriented metrics (i.e., resource utilization and target response time). Therefore, together with Q-learning, we propose Deep Q-learning [8]. It uses deep neural networks to approximate the system dynamics and to drive the application elasticity. Using simulation, we demonstrate the advantages of the Deep Q-learning solution, which learns a suitable scaling policy while meeting QoS requirements expressed in term of target application response time (i.e., Rmax ). 2 Related Work The existing elasticity solutions rely on a wide set of methodologies, that we classify in the following categories: mathematical programming, control theory, queuing theory, threshold-based, and machine-learning solutions. The mathemat- ical programming approaches exploit methods from operational research in order to solve the application deployment problem (e.g., [13,21]). The main drawback of these solutions is scalability; since the deployment problem is NP-hard, other efficient solutions are needed. As surveyed in [15], few solutions use control theory to change the replication degree of application containers (e.g., [3]). The critical point of the control-theoretic approaches is the requirement of a good system model, which can sometimes be difficult to be formulated. The key idea of queuing theory is to model the application as a queuing system with inter-arrival and service times having general statistical distributions (e.g., [4,10]). Nevertheless, queuing theory often requires to approximate the system behavior, discouraging its adoption in a real environment. Most of the existing solutions exploit best-effort threshold-based policies based on the definition of static thresholds for adapting the application deployment at run-time (e.g., [7,16]). Although a threshold-based scaling policy can be easily implemented, it is a best-effort approach that moves complexity from determining the reconfiguration strategy to the selection of critical values that act as thresholds. In the last few years, machine learning has become a widespread approach to solve at run-time the application deployment problem. RL is a machine learning technique by which an agent can learn how to make good (scaling) decisions through a sequence of interactions with the environment [17]. After executing an adaptation action in the monitored system state, the RL agent experiences a cost that contributes to learning how good the performed action was. The obtained cost leads an update of a lookup table that stores the estimation of the long-term cost for each state-action pair (i.e., the Q-function). Many solutions considered the classic model-free RL algorithms (e.g., [9,18]), which however suffer from slow convergence rate. To tackle this issue, different model-based RL approaches have been proposed, e.g., [14,19]. They use a model of the system to drive the action exploration and speed-up the learning phase. Although model-based RL approaches can overcome the slow A Deep Q-learning Scaling Policy for Elastic Application Deployment 57 convergence rate of model-free solutions, they can suffer from poor scalability in systems with a large state space. In this solution, the lookup table has to store a separate value for each state-action pair. An approach to overcome this issue consists in approximating the system behavior, so that the agent can explore a reduced number of system configurations. Integrating deep neural networks into Q-learning, Deep Q-learning has been widely applied to approximate the system dynamics in a variety of domains, e.g., traffic offloading [1]. However, to the best of our knowledge, it is so far poorly applied in the context of application elasticity. Differently from all the above contributions, in this paper we propose an application scaling policy based on Deep Q-learning; then, we compare it against Q-learning. 3 RL-based Scaling Policy RL is a special method belonging to the branch of machine learning. It refers to a collection of trial-and-error methods by which an agent must prefer actions that it found to be effective in the past (exploitation). However, to discover such actions, it has to explore new actions (exploration). At each discrete time step i, according to the monitored metrics, the RL agent determines the application state and updates the expected long-term cost (i.e., Q-function). We define the application state as s = (k, u), where k is the number of application instances, and u is the monitored CPU utilization. We denote by S the set of all the application states. We assume that k ∈ {1, 2, ..., Kmax }; being the CPU utilization (u) a real number, we discretize it by defining that u ∈ {0, ū, ..., Lū}, where ū is a suitable quanta and L ∈ N such that L· ū = 1. For each state s ∈ S, we define the set of possible adaptation actions as A(s) ⊆ {−1, 0, 1}, where +1 is the scale-out action, −1 the scale-in action, and 0 is the do nothing decision. We observe that not all the actions are available in any application state, due to the lower and upper bounds on the number of application instances (i.e., 1 and Kmax , respectively). According to an action selection policy, the RL agent identifies the scaling action a to be performed in state s. The execution of a in s leads to the transition in a new application state (i.e., s0 ) and to the payment of an immediate cost. We define the immediate cost c(s, a, s0 ) as the weighted sum of different normalized terms, such as the performance penalty, cperf , and resource cost, cres . Formally, we have: c(s, a, s0 ) = wperf · cperf + wres · cres (1) where wperf and wres , wperf + wres = 1, are non negative weights that allow us to express the relative importance of each cost term. The performance penalty is paid whenever the average application response time exceeds the target value Rmax . The resource cost is proportional to the number of application instances. We can observe that the formulation of the immediate cost function c(s, a, s0 ) is general enough and can be easily customized with other QoS requirements. The received immediate cost contributes to update the Q-function. The Q- function consists in Q(s, a) terms, which represent the expected long-term cost 58 Fabiana Rossi that follows the execution of action a in state s. The existing RL policies differ in how they update the Q-function and select the adaptation action to be performed (i.e., action selection policy) [17]. In [14], for example, we propose a model-based RL approach that enriches the RL agent with a model of the system to drive the exploration actions and speed up the learning phase. Since determining the system model can be a not trivial task, in this paper we consider two model-free RL solutions to adapt the application deployment. First, we consider the simple model-free Q-learning (QL) algorithm that uses a table (i.e., Q-table) to store the Q-value for each state-action pair. The Q-table allows to store the real experience without approximation. However, this approach may suffer from slow convergence rate when the number of state-action pairs increases. Then, to tackle this issue, we present a Deep Q-learning (DQL) approach that combines Q-learning with deep neural networks. The neural network allows to approximate the Q-function using a non-linear function; in such a way, the agent can directly compute Q(s, a) using s and a, instead of performing a Q-table lookup. By using a Q-function approximation, the RL agent does not need to explore all the state-action pairs before learning a good adaptation policy. Q-learning. At time i, the QL agent selects the action a to perform in state s using an -greedy policy on Q(s, a); the application transits in s0 and experiences an immediate cost c. The -greedy policy selects the best known action for a particular state (i.e., a = argmina∈A(s) Q(s, a)) with probability 1 − , whereas it favors the exploration of sub-optimal actions with low probability, . At the end of time slot i, Q(s, a) is updated using a simple weighted average:   Q (s, a) ← (1 − α)Q (s, a) + α c + γ 0 min 0 Q(s0 , a0 ) (2) a ∈A(s ) where α, γ ∈ [0, 1] are the learning rate and the discount factor, respectively. Deep Q-learning. DQL uses a multi-layered neural network, called Q-network, to approximate the Q-function. For each time slot i, the DQL agent observes the application state and selects an adaptation action using an -greedy policy and the estimates of Q-values, as Q-learning does. Note that DQL is model-free: it solves the RL task directly using samples, without explicitly modeling the system dynamics. In a given state s, the Q-network outputs a vector of action values Q(s, ·, φ), where φ are the network parameters. By approximating the Q-function, the RL agent can explore a reduced number of system configurations before learning a good adaptation policy. At the end of each time slot i, the Q-network is updated by performing a gradient-descent step on (yi − Q(s, a, φi ))2 with respect to the network parameters φi . yi is the estimated long-term cost, defined as yi = c + γ · mina0 Q(s0 , a0 , φi ), where γ is the discount factor. When only the current experience is considered, i.e., (s, a, c, s0 ), this approach is too slow for practical real scenarios. Moreover, it is unstable due to correlations existing in the sequence of observations. To overcome these issues, we consider a revised DQL algorithm that uses a replay buffer and a separate target network to compute yi [8]. To perform experience replay, the agent store its experience in a buffer with finite capacity. A mini-batch of experience is drawn uniformly at random from the replay buffer to remove correlations in the observation sequence A Deep Q-learning Scaling Policy for Elastic Application Deployment 59 and to smooth over changes in the data distribution. In the classic DQL, the same Q-network is used both to select and to evaluate an action. This can lead to select overestimated values, resulting in overoptimistic estimates. To prevent this, two networks are used (i.e., on-line and target network) and two value functions are learned. The on-line network is used to determine the greedy policy and the target network to determine its value. The target network parameters are updated to the on-line network values only every τ steps and are held fixed between individual updates. 4 Results We evaluate the proposed deployment adaptation solutions using simulations. Without lack of generality, at each discrete time step i, we model the application as an M/M/ki queue, where ki is the number of application replicas. We set the service rate µ to 120 requests/s. As shown in Fig. 1, the application receives a varying number of requests. It follows the workload of a real distributed application [6]. The application expresses the QoS in terms of target response time Rmax = 15 ms. The RL algorithms use the following parameters: α = 0.1 and discount factor γ = 0.99. We discretize the application state with Kmax = 10 and ū = 0.1. To update the application deployment, QL and DQL use an -greedy action selection policy, with  = 0.1. DQL uses a replay memory with capacity of 50 observations and a batch size of 30; the target Q-network update frequency is τ = 5 time units. We use Deeplearning4j1 library to implement the neural networks. Correctly configuring the Q-network is an empirical task, which requires some effort and several preliminary evaluations. In particular, we use ReLu as the neuron activation function; due to its non-linear behavior, it is one of the most commonly used function. To initialize the Q-network weights, we use the Xavier method [20]. To avoid weights to diminish or explode during network propagation, this method scales the weight distribution on a layer-by-layer basis. To this end, it uses a normal distribution with centered mean and standard deviation scaled to the number of layer’s input and output neurons. The Q-network architecture 1 https://deeplearning4j.org/ 600 Data rate (requests/s) 500 400 300 200 100 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Simulation Time Figure 1: Workload used for the reference application. 60 Fabiana Rossi Table 1: Application performance under scaling policies. Elasticity Configuration Rmax violations Average CPU Average number Median of Response Policy (%) utilization (%) of replicas time (ms) QL wperf = 1, wres = 0 3.15 56.20 4.17 9.51 wperf = 0.5, wres = 0.5 13.26 51.27 5.13 8.81 wperf = 0, wres = 1 42.27 65.56 3.90 11.88 DQL wperf = 1, wres = 0 1.05 36.85 6.63 8.38 wperf = 0.5, wres = 0.5 39.23 66.64 3.58 11.11 wperf = 0, wres = 1 88.51 89.18 1.58 +∞ DQL with wperf = 1, wres = 0 1.39 35.54 7.41 8.35 pre-trained wperf = 0.5, wres = 0.5 31.91 58.53 4.30 9.06 network wperf = 0, wres = 1 83.48 86.12 1.71 +∞ Response time Response time 25 25 20 20 (ms) (ms) 15 15 10 10 5 5 0 0 CPU Utilization CPU Utilization (percentage) (percentage) 100 100 75 75 50 50 25 25 0 0 10 10 Number of 8 Number of 8 replicas 6 replicas 6 4 4 2 2 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000 Simulation Time Simulation Time (a) Deep Q-learning (b) Q-learning Figure 2: Application performance using the weights wres = 0 and wperf = 1. is fully-connected with 4 layers having {2, 15, 15, 3} neurons (i.e., there are 2 hidden layers). Table 1 summarizes the experimental results, including also the application performance obtained with a pre-trained DQL. We can see that the application has a different performance when different weights for the cost function are used (Eq. 1). We first consider the set of weights wperf = 1 and wres = 0: in this case, optimizing the application response time is more important than saving resources. As shown in Fig. 2b, the QL solution often changes the application deployment performing scaling operations. Moreover, the application response time exceeds Rmax for 3.15% of the time. Conversely, taking advantage of the approximated system knowledge, the DQL solution learns a better elasticity policy that successfully controls the application deployment (Fig. 2a). It registers 1.05% of Rmax violations and a median of the application response time lower than the target application response time (i.e., 8.38 ms). We now consider the case when saving resources is more important than meeting the Rmax bound, i.e., wres = 1 and wperf = 0. Intuitively, the RL agent should learn how to improve resource utilization at the expense of a high application response time (i.e., that exceeds Rmax ). Table 1 and Fig. 3 show that, in general, DQL performs better than QL in terms of resource usage. DQL registers 89.18% of resource A Deep Q-learning Scaling Policy for Elastic Application Deployment 61 Response time Response time 25 25 20 20 (ms) (ms) 15 15 10 10 5 5 0 0 CPU Utilization CPU Utilization (percentage) (percentage) 100 100 75 75 50 50 25 25 0 0 10 10 Number of Number of 8 8 replicas replicas 6 6 4 4 2 2 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000 Simulation Time Simulation Time (a) Deep Q-learning (b) Q-learning Figure 3: Application performance using the weights wres = 1 and wperf = 0. Response time Response time 25 25 20 20 (ms) (ms) 15 15 10 10 5 5 0 0 CPU Utilization CPU Utilization (percentage) (percentage) 100 100 75 75 50 50 25 25 0 0 10 10 Number of Number of 8 8 replicas replicas 6 6 4 4 2 2 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000 Simulation Time Simulation Time (a) Deep Q-learning (b) Q-learning Figure 4: Application performance using the weights wres = 0.5 and wperf = 0.5. utilization running with 1.58 application replicas. This is very close to the lowest amount of resources assignable to the application. Run-time adaptations are also avoided. As a consequence, the application is overloaded and the resulting median response time is unbounded. Conversely, the QL solution struggles to find a stable configuration. It identifies an adaptation policy that runs the application using, on average, 3.90 instances. On average, its resource usage is lower than in DQL (65.56% and 89.18%, respectively), as also the percentage of the target application response time Rmax violations. Besides the weight configurations at the opposite ends, we can obtain a wide set of adaptation strategies that differ by the relative importance of the two deployment goals. In Table 1 we propose a simple case, where we set wperf = wres = 0.50. To visualize the update of the application deployment by the two RL policies, we report in Fig. 4 the application behavior during the whole experiment, when we want to optimize the performance avoiding resource wastage (wperf = wres = 0.5). Intuitively, the RL agent has to find a trade-off between the resource usage and the number of Rmax violations. The neural network allows to approximate the Q-function using a non-linear function; in such a way, DQL can explore a reduced number of system configurations before learning a good adaptation policy. On average, it runs 62 Fabiana Rossi the application with 3.58 replicas, registering a 66.64% of CPU utilization. The median application response time is 11.11 ms, with about 39% of Rmax violations. Although we could pre-train the Q-network to further improve the DQL learned policy (mitigating also the initial exploration phase), we observe that the obtained results are already remarkable, considering that DQL is model-free. Conversely, when QL updates the application deployment, it continuously performs scaling actions, meaning that QL is still exploring the best actions to perform. This behavior is also reflected on the number of application instances, whose average value is greater than those used in the wperf = 1 configuration. Being model- free and storing experience without approximation, QL cannot quickly learn a suitable adaptation strategy for intermediate cost weight configurations during the experiment. Discussion. In this paper, we evaluated QL and DQL to adapt the application deployment at run-time. First, we showed the flexibility provided by a RL- based solution for updating the application deployment. By correctly defining the relative importance of the deployment objectives through the cost function weights in Eq. 1, the RL agent can accordingly learn a suitable application deployment strategy. Very different application behavior can be obtained when we aim to optimize the application response time, resource saving, or a combination thereof. Second, we showed that a DQL approach takes advantage of the approximate system knowledge and outperforms QL, especially when we pre-train the Q- network. We observe that, although we do not need to define the system model as in a model-based approach, DQL introduces the effort of defining a suitable Q-network architecture. However, this is an empirical process that may require a large number of preliminary experiments and trial-and-error repetitions. 5 Conclusion Most policies for scaling applications resort on threshold-based heuristics that require to express how specific goals should be achieved. In this paper, aiming to design more flexible solution, we have proposed Q-learning and Deep Q-learning policies for controlling the application elasticity. Relying on a simulation-based evaluation, we have shown the benefits of the proposed RL-based approaches. Deep Q-learning exploits deep neural networks to approximate the system dynamics, estimated through system interactions. The deep neural network speeds up the learning phase, improving the application performance; however, modeling the neural network architecture can be challenging. As future work, we plan to further investigate RL approaches for elasticity. We will investigate more sophisticated techniques for improving the convergence speed of the learning process (e.g., by leveraging Bayesian Decision Trees, Function Approximation). Moreover, we plan to extend our model by explicitly considering multiple system-oriented metrics within the adaptation policies. A Deep Q-learning Scaling Policy for Elastic Application Deployment 63 References 1. Alam, M.G.R., Hassan, M.M., Uddin, M.Z., Almogren, A., Fortino, G.: Autonomic computation offloading in mobile edge for IoT applications. Future Gener. Comput. Syst. 90, 149 – 157 (2019) 2. Arabnejad, H., Pahl, C., Jamshidi, P., Estrada, G.: A comparison of reinforcement learning techniques for fuzzy cloud auto-scaling. In: Proc. of IEEE/ACM CCGrid ’17. pp. 64–73 (2017) 3. Baresi, L., Guinea, S., Leva, A., Quattrocchi, G.: A discrete-time feedback controller for containerized cloud applications. In: Proc. of ACM SIGSOFT FSE ’16. pp. 217–228. ACM (2016) 4. Gias, A.U., Casale, G., Woodside, M.: Atom: Model-driven autoscaling for microser- vices. In: Proc. of IEEE ICDCS ’19. pp. 1994–2004 (2019) 5. Horovitz, S., Arian, Y.: Efficient cloud auto-scaling with SLA objective using Q-learning. In: Proc. of IEEE FiCloud ’18. pp. 85–92 (2018) 6. Jerzak, Z., Ziekow, H.: The debs 2015 grand challenge. In: Proc. ACM DEBS’15. pp. 266–268 (2015) 7. Kwan, A., Wong, J., Jacobsen, H., Muthusamy, V.: Hyscale: Hybrid and network scaling of dockerized microservices in cloud data centres. In: Proc. of IEEE ICDCS ’19. pp. 80–90 (2019) 8. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015) 9. Nouri, S.M.R., Li, H., Venugopal, S., Guo, W., He, M., Tian, W.: Autonomic decentralized elasticity based on a reinforcement learning controller for cloud applications. Future Gener. Comput. Syst. 94, 765–780 (2019) 10. Rossi, F., Cardellini, V., Lo Presti, F.: Hierarchical scaling of microservices in Kubernetes. In: Proc. of IEEE ACSOS ’20. pp. 28–37 (2020) 11. Rossi, F., Cardellini, V., Lo Presti, F.: Self-adaptive threshold-based policy for microservices elasticity. In: In Proc. of IEEE MASCOTS ’20. pp. 1–8 (2020) 12. Rossi, F.: Auto-scaling policies to adapt the application deployment in Kubernetes. In: CEUR Workshop Proc. 2020. vol. 2575, pp. 30–38 (2020) 13. Rossi, F., Cardellini, V., Lo Presti, F., Nardelli, M.: Geo-distributed efficient deployment of containers with Kubernetes. Comput. Commun 159, 161 – 174 (2020) 14. Rossi, F., Nardelli, M., Cardellini, V.: Horizontal and vertical scaling of container- based applications using Reinforcement Learning. In: Proc. of IEEE CLOUD ’19. pp. 329–338 (2019) 15. Shevtsov, S., Weyns, D., Maggio, M.: Self-adaptation of software using automatically generated control-theoretical solutions. Engineering Adaptive Software Systems pp. 35–55 (2019) 16. Srirama, S.N., Adhikari, M., Paul, S.: Application deployment using containers with auto-scaling for microservices in cloud environment. J. Netw. Comput. Appl. 160 (2020) 17. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2 edn. (2018) 18. Tang, Z., Zhou, X., Zhang, F., Jia, W., Zhao, W.: Migration modeling and learning algorithms for containers in fog computing. IEEE Trans. Serv. Comput. 12(5), 712–725 (2019) 19. Tesauro, G., Jong, N.K., Das, R., Bennani, M.N.: A hybrid Reinforcement Learning approach to autonomic resource allocation. In: Proc. of IEEE ICAC ’06. pp. 65–73 (2006) 64 Fabiana Rossi 20. Xavier, G., Yoshua, B.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS’10. vol. 9, pp. 249–256 (2010) 21. Zhao, D., Mohamed, M., Ludwig, H.: Locality-aware scheduling for containers in cloud computing. IEEE Trans. Cloud Comput. 8(2), 635–646 (2018)