Auto-scaling Policies to Adapt the Application Deployment in Kubernetes Fabiana Rossi Department of Civil Engineering and Computer Science Engineering University of Rome Tor Vergata, Italy f.rossi@ing.uniroma2.it Abstract The ever increasing diffusion of computing devices enables a new generation of containerized applications that operate in a distributed cloud environment. Moreover, the dynamism of working conditions calls for an elastic application deployment, which can adapt to changing workloads. Despite this, most of the existing orchestration tools, such as Kubernetes, include best-effort threshold-based scaling policies whose tuning could be cumbersome and application dependent. In this paper, we compare the default threshold-based scaling policy of Kubernetes against our model-based reinforcement learning policy. Our solution learns a suitable scaling policy from the experience so to meet Quality of Service requirements expressed in terms of average response time. Using prototype-based experiments, we show the benefits and flexibility of our reinforcement learning policy with respect to the default Kubernetes scaling solution. Keywords: Kubernetes · Elasticity · Reinforcement Learning · Self- adaptive systems. 1 Introduction Elasticity allows to adapt the application deployment at run-time in face of changing working conditions (e.g., incoming workload) and to meet stringent Quality of Service (QoS) requirements. Exploiting operating system level virtual- ization, software containers allow to simplify the deployment and management of applications, also offering a reduced computational overhead with respect to virtual machines. The most popular container management system is Docker. It allows to simplify the creation, distribution, and execution of applications inside containers. Although the container management systems can be used to deploy simple containers, managing a complex application (or multiple applications) at run-time requires an orchestration tool. The latter automates container pro- visioning, management, communication, and fault-tolerance. Although several orchestration tools exist [5,8], Kubernetes 1 , an open-source platform introduced by Google in 2014, is the most popular solution. Kubernetes includes a Horizontal 1 https://kubernetes.io J. Manner, S. Haarmann, S. Kolb, O. Kopp (Eds.): 12th ZEUS Workshop, ZEUS 2020, Potsdam, Germany, 20-21 February 2020, published at http://ceur-ws.org/Vol-2575 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Auto-scaling Policies to Adapt Application Deployment in Kubernetes 31 Pod Autoscaler enabling to automatically scale the application deployment using a threshold-based policy based on cluster-level metrics (i.e., CPU utilization). However, this threshold-based scaling policy is not well suited to satisfy QoS requirements of latency-sensitive applications. Determining a suitable threshold is cumbersome, requiring to identify the relation between a system metric (i.e., utilization) and an application metric (i.e., response time), as well as to know the application bottleneck (e.g., in terms of CPU or memory). In this paper, we compare the default threshold-based scaling policy of Kubernetes against model-free and model-based reinforcement learning policies [14]. Our model-based solution automatically learns a suitable scaling policy from the experience so to meet QoS requirements expressed in term of average response time. To perform such comparison, we use our extension of Kubernetes, which includes a more flexible autoscaler that can be easily equipped with new scaling policies. The remainder of the paper is organized as follows. In Section 2, we discuss related works. In Section 3, we describe the Kubernetes features. Then, we propose a reinforcement learning-based scaling policy to adapt at run-time the deployment of containerized applications (Section 4). In Section 5, we evaluate the proposed solutions using prototype-based experiments. We show the flexibility and efficacy of using a reinforcement learning solution compared to the default Kubernetes scaling policy. In Section 6, we outline the ongoing and future research directions. 2 Related Work The elasticity of containers is carried out in order to achieve different objectives: to improve application performance (e.g., [4]), load balancing and resource utilization (e.g., [1,11]), energy efficiency (e.g., [3]), and to reduce the deployment cost (e.g., [6,2]). Few works also consider a combination of deployment goals (e.g., [18]). Threshold-based policies are the most popular approaches to scale containers at run-time (e.g., [4,10]). Also the noteworthy orchestration tools (e.g., Kubernetes, Docker Swarm, Amazon ECS, and Apache Hadoop YARN) usually rely on best-effort threshold-based scaling policies based on some cluster-level metrics (e.g., CPU utilization). However, all these approaches require a non- trivial manual tuning of the thresholds, which can also be application-dependent. To overcome to this issue, solutions in literature propose container deployment methods ranging from mathematical programming to machine learning solutions. The mathematical programming approaches exploit methods from operational research in order to solve the application deployment problem (e.g., [12,13,18]). Since such a problem is NP-hard, other efficient solutions are needed. In the last few years, reinforcement learning (RL) has become a widespread approach to solve the application deployment problem at run-time. RL is a machine learning technique by which an agent can learn how to make (scaling) decisions through a sequence of interactions with the environment [15]. Most of the existing solutions consider the classic model-free RL algorithms (e.g., [7,16,17]), which however suffer from slow convergence rate. To tackle this issue, in [14], we propose a novel model-based RL solution that exploits what is known (or can be 32 Fabiana Rossi estimated) about the system dynamics to adapt the application deployment at run-time. Experimental results based on Docker Swarm have shown the flexibility of our approach, which can learn different adaptation strategies according to the optimized deployment objectives (e.g., meet QoS requirements in terms of average response time). Moreover, we have shown that the model-based RL agent learns a better adaptation policy than other model-free RL solutions. Encouraged by the previous promising results, in this paper, we integrate the model-based RL solution in Kubernetes, one of the most popular container orchestration tools used in the academic and industrial world. Experimental results in [8] demonstrate that Kubernetes performs better than other existing orchestration tools, such as Docker Swarm, Apache Mesos, and Cattle. However, Kubernetes is not suitable for managing latency-sensitive applications in a extremely dynamic environment. It is equipped with a static best-effort deployment policy that relies on system-oriented metrics to scale applications in face of workload variations. In this paper, we first extend Kubernetes to easily introduce self-adaptation capabilities. Then, we integrate RL policies in Kubernetes and compare them against the default Kubernetes auto-scaling solution. 3 Kubernetes Kubernetes is an open-source orchestration platform that simplifies the deploy- ment, management, and execution of containerized applications. Based on a master-worker decentralization pattern, it can replicate containers for improving resource usage, load distribution, and fault-tolerance. The master node main- tains the desired state at run-time by orchestrating applications (using pods). A worker is a computing node that offers its computational capability to enable the execution of pods in distributed manner. A pod is the smallest deployment unit in Kubernetes. When multiple containers run within a pod, they are co-located and scaled as an atomic entity. To simplify the deployment of applications, Ku- bernetes introduces Deployment Controllers that can dynamically create and destroy pods, so to ensure that the desired state (described in the deployment file) is preserved at run-time. Kubernetes also includes a Horizontal Pod Autoscaler2 to automatically scale the number of pods in a Deployment based on the ratio between the target value and the observed value of pod’s CPU utilization. Setting the CPU utilization threshold is a cumbersome and error-prone task and may require a knowledge of the application resource usage to be effective. To address this limitation, we equip Kubernetes with a decentralized control loop. In a single loop iteration, it monitors the environment and the containerized applications, analyzes application-level (i.e., response time) and cluster-level (i.e., CPU utilization) metrics, and plans and executes the corresponding scaling ac- tions. The modularity of the control loop allows us to easily equip it with different QoS-aware scaling policies. To dynamically adapt the application deployment according to the workload variations, we consider RL policies. 2 https://kubernetes.io/docs/tasks/run-application/ horizontal-pod-autoscale/ Auto-scaling Policies to Adapt Application Deployment in Kubernetes 33 4 Reinforcement Learning Scaling Policy Differently from the Kubernetes scaling policy, we aim to design a flexible solution that can be easily customized by manually tuning various configuration parameters. In this paper, we customize the RL solution proposed in [14] to scale at run-time the number of application instances (i.e., pods). RL refers to a collection of trial-and-error methods by which an agent must prefer actions that it found to be effective in the past (exploitation). However, to discover such actions, it has to explore new actions (exploration). In a single control loop iteration, the RL agent selects the adaptation action to be performed. As first step, according to the received application and cluster-oriented metrics, the RL agent determines the Deployment Controller state and updates the expected long-term cost (i.e., Q-function). We define the application state as s = (k, u), where k is the number of application instances (i.e., pods), and u is the monitored CPU utilization. We denote by S the set of all the application states. We assume that k ∈ {1, 2, ..., Kmax }; being the CPU utilization (u) a real number, we discretize it by defining that u ∈ {0, ū, ..., Lū}, where ū is a suitable quanta. For each state s ∈ S, we define the set of possible adaptation actions as A(s) ⊆ {−1, 0, 1}, where ±1 defines a scaling action (i.e., +1 to scale-out and −1 to scale-in), and 0 is the do nothing decision. Obviously, not all the actions are available in any application state, due to the upper and lower bounds on the number of pods per application (i.e., Kmax and 1, respectively). Then, according to an action selection policy, the RL agent identifies the scaling action a to be performed in state s. The execution of a in s leads to the transition in a new application state (i.e., s0 ) and to the payment of an immediate cost. We define the immediate cost c(s, a, s0 ) as the weighted sum of different terms, such as the performance penalty, cperf , resource cost, cres , and adaptation cost, cadp . We normalized them in the interval [0, 1], where 0 represents the best value (no cost), 1 the worst value (highest cost). Formally, we have: c(s, a, s0 ) = wperf · cperf + wres · cres + wadp · cadp , where wadp , wperf and wres , wadp + wperf + wres = 1, are non negative weights that allow us to express the relative importance of each cost term. We can observe that the formulation of the immediate cost function c(s, a, s0 ) is general enough and can be easily customized with other QoS requirements. The performance penalty is paid whenever the average application response time exceeds the target value Rmax . The resource cost is proportional to the number of application instances (i.e., pods). The adaptation cost captures the cost introduced by Kubernetes to perform a scaling operation. The traffic routing strategy used in Kubernetes forwards the application requests to the newly added pod, even if not all containers in the pod are already running. We observe that, for this reason, we prefer horizontal scaling to vertical scaling operations. When a vertical scaling changes a pod configuration (e.g., to update its CPU limit), Kubernetes spawns new pods as a replacement of those with the old configuration. In this phase, the application availability decreases and only a subset of the incoming requests are processed. Conversely, a scale-out action introduces a reduced adaptation cost inversely proportional to the number of application instances. 34 Fabiana Rossi The received immediate cost contributes to update the Q-function. The Q-function consists in Q(s, a) terms, which represent the expected long-term cost that follows the execution of action a in state s. The existing RL policies differ in how they update the Q-function and select the adaptation action to be performed (i.e., action selection policy) [15]. To adapt the application deployment, we consider a model-based solution which we have extensively evaluated in [14]. At any decision step, the proposed model-based RL solution does not use an action selection policy (e.g., -greedy action selection policy) but it always selects the best action in term of Q-values, i.e., a = arg mina0 ∈A(s) Q(s, a0 ). Moreover, to update the Q-function, the simple weighted average of the traditional RL solutions (e.g., Q-learning) is replaced by the Bellman equation [15]: X 0 h i 0 0 0 ∀s∈S, Q(s, a) = p(s |s, a) c(s, a, s ) + γ min 0 Q(s , a ) ∀a∈A(s) (1) a ∈A s0 ∈S where γ ∈ [0, 1) is the discount factor, p(s0 |s, a) and c(s, a, s0 ) are, respectively, the transition probabilities and the cost function ∀s, s0 ∈ S and a ∈ A(s). Thanks to the experience, the proposed model-based solution is able to maintain an empirical model of the unknown external system dynamics (i.e., p(s0 |s, a) and c(s, a, s0 )) speeding-up the learning phase. Further details on our model-based RL solution can be found in [14]. 5 Results We show the self-adaptation capabilities of Kubernetes when equipped with model-free and model-based RL policies as well as the default threshold-based solution (by the Horizontal Pod Autoscaler). The RL solutions scale pods using user-oriented QoS attributes (i.e., response time), whereas the Horizontal Pod Autoscaler uses a best-effort threshold-based policy based on cluster-level metrics (i.e., CPU utilization). The evaluation uses a cluster of 4 virtual machines of the Google Cloud Platform; each virtual machine has 2 vCPUs and 7.5 GB of RAM (type: n1-standard-2). We consider a reference CPU-intensive application that computes the sum of the first n elements of the Fibonacci sequence. As shown in Figure 1, the application receives a varying number of requests. It follows the workload of a real distributed application [9], accordingly amplified 700 600 Data rate (reqs/s) 500 400 300 200 100 0 0 50 100 150 200 250 300 350 Time (minutes) Figure 1: Workload used for the reference application. Auto-scaling Policies to Adapt Application Deployment in Kubernetes 35 Response time Response time 600 600 450 450 (ms) (ms) 300 300 150 150 0 0 100 100 Utilization Utilization 75 75 50 50 25 25 0 0 10 10 Number of Number of 8 8 pods pods 6 6 4 4 2 2 0 0 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Time (minutes) Time (minutes) (a) Threshold at 60% of CPU utilization. (b) Threshold at 70% of CPU utilization. Response time 600 450 (ms) 300 150 0 100 Utilization 75 50 25 0 10 Number of 8 pods 6 4 2 0 0 50 100 150 200 250 300 350 Time (minutes) (c) Threshold at 80% of CPU utilization. Figure 2: Application performance using Horizontal Pod Autoscaler. and accelerated so to further stress the application resource requirements. The application expresses the QoS in terms of target response time Rmax = 80 ms. To meet Rmax , it is important to accordingly adapt the number of application instances. The Kubernetes autoscaler executes a control loop every 3 minutes. To learn an adaptation policy, we parameterize the model-based RL algorithm as in our previous work [14]. For sake of comparison, we consider also the model-free Q- learning approach that chooses a scaling action according to the -greedy selection policy: at any decision step, the Q-learning agent chooses, with probability , a random action, whereas, with probability 1 − , it chooses the best known action. For Q-learning, we set  to 10%. To discretize the application state, we use Kmax = 10 and ū = 0.1. For the immediate cost function, we consider the set of weights wperf = 0.90, wres = 0.09, wadp = 0.01. This weight configuration allows to optimize the application response time, considered to be more important than saving resources and reducing the adaptation costs. The default Kubernetes threshold-based scaling policy is application unaware and not flexible, meaning that it is not easy to satisfy QoS requirements of latency- sensitive applications by setting a threshold on CPU utilization (see Figures 2a– 2c). From Table 1, we can observe that small changes in the threshold setting lead to a significant performance deterioration. Setting the scaling threshold is 36 Fabiana Rossi Response time Response time 600 600 450 450 (ms) (ms) 300 300 150 150 0 0 Utilization 100 100 Utilization 75 75 50 50 25 25 0 0 10 10 Number of Number of 8 8 pods pods 6 6 4 4 2 2 0 0 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Time (minutes) Time (minutes) (a) Q-learning. (b) Model-based RL. Figure 3: Application performance using RL policies. Table 1: Application performance under the different scaling policies. Elasticity Rmax violations Average CPU Average number Median of Response Adaptations Policy (%) utilization (%) of pods time (ms) (%) Model-based 14.40 48.51 3.75 16.38 56.00 Q-learning 64.0 76.94 2.90 201.71 65.6 HPA thr = 60 9.20 50.81 3.54 16.11 6.38 HPA thr = 70 21.43 54.43 3.14 34.61 10.96 HPA thr = 80 40.12 63.70 3.18 37.54 12.89 cumbersome, e.g., with threshold on 80% of CPU utilization, we obtain a rather high number of Rmax violations. With the scaling threshold at 70% of CPU utilization, the application violates Rmax 21% of time, with 54% of average CPU utilization. With the scaling threshold at 60% of CPU utilization, the application has better performance (Rmax is exceeded only 9% of time), even though we might still perform a finer threshold tuning to further increase it. Conversely, the RL approach is general and more flexible, requiring only to specify the desired deployment objectives. It allows to indicate what the user aims to obtain (through the cost function weights), instead of how it should be obtained. In particular, a RL learning agent learns the scaling policy in an automatic manner. Figures 3a and 3b show the application performance when the model-free and model-based RL solutions are used. The RL agent starts with no knowledge on the adaptation policy, so it begins to explore the cost of each adaptation action. When Q-learning is used, the RL agent slowly learns how to adapt the application deployment. As we can see from Figure 3a and Table 1, the application deployment is continuously updated (i.e., 66% of the time) and the RL agent does not learn a good adaptation policy within the experiment duration. As a consequence, the application response time exceeds Rmax most of the time. Taking advantage of the system knowledge, the model-based solution has a very different behavior: it obtains better performance and more quickly reacts to workload variations. We can see that, in the first minutes of the experiment, the Auto-scaling Policies to Adapt Application Deployment in Kubernetes 37 model-based solution does not always respect the target application response time. However, as soon as a suitable adaptation policy is learned, the model-based RL solution can successfully scale the application and meet the application response time requirement most of the time. The learned adaptation policy deploys a number of pods that follows the application workload (see Figures 1 and 3b), maintaining a reduced number of Rmax violations (14.4%) and a good average resource utilization (49%). We should observe that, even though a fine grained threshold tuning can be performed (thus improving performance of the default Kubernetes scaling policy), the RL-based approach automatically learns a suitable and satisfying adaptation strategy. Moreover, changing the cost function weights, the RL solution can easily learn different scaling policies, e.g., to improve resource utilization or to reduce deployment adaptations [14]. 6 Conclusion Kubernetes is one of the most popular orchestration tools to manage containers in a distributed environment. To react to workload variations, it includes a threshold-based scaling policy that changes the application deployment according to cluster-level metrics. However, this approach is not well-suited to meet stringent QoS requirements. In this paper, we compare model-free and model-based RL scaling policies against the default threshold-based solution. The prototype-based results have shown the flexibility and benefits of RL solutions: while the model- free Q-learning suffers from slow convergence time, the model-based approach can successfully learn the best adaptation policy, according to the user-defined deployment goals. As future work, we plan to investigate the deployment of applications in geo- distributed environment, including edge/fog computing resources located at the network edges. The default Kubernetes scheduler spreads containers on computing resources not taking into account the not-negligible network delay among them. This can negatively impact the performance of latency-sensitive applications. Therefore, alongside the elasticity problem, also the placement problem (or scheduling problem) should be efficiently solved at run-time. We want to extend the proposed heuristic so to efficiently control the scaling and placement of multi- component applications (e.g., micro-services). When an application consists of multiple components that cooperate to accomplish a common task, adapting the deployment of a component impacts on performance of the other components. We are interested in considering the application as a whole, so to develop policies that can adapt, in proactive manner, the deployment of inter-connected components, avoiding performance penalties. Acknowledgment The author would like to thank her supervisor Prof. Valeria Cardellini and to acknowledge the support by Google with the GCP research credits program. 38 Fabiana Rossi References 1. Abdelbaky, M., Diaz-Montes, J., Parashar, M., Unuvar, M., Steinder, M.: Docker containers across multiple clouds and data centers. In: Proc. of IEEE/ACM UCC 2015. pp. 368–371 (2015) 2. Al-Dhuraibi, Y., Paraiso, F., Djarallah, N., Merle, P.: Autonomic vertical elasticity of Docker containers with ElasticDocker. In: Proc. of IEEE CLOUD ’17. pp. 472–479 (2017) 3. Asnaghi, A., Ferroni, M., Santambrogio, M.D.: DockerCap: A software-level power capping orchestrator for Docker containers. In: Proc. of IEEE EUC ’16. pp. 90–97 (2016) 4. Barna, C., Khazaei, H., Fokaefs, M., Litoiu, M.: Delivering elastic containerized cloud applications to enable DevOps. In: Proc. of SEAMS ’17. pp. 65–75 (2017) 5. Casalicchio, E.: Container orchestration: A survey. In: Systems Modeling: Method- ologies and Tools, pp. 221–235. Springer International Publishing, Cham (2019) 6. Guan, X., Wan, X., Choi, B.Y., Song, S., Zhu, J.: Application oriented dynamic resource allocation for data centers using Docker containers. IEEE Commun. Lett. 21(3), 504–507 (2017) 7. Horovitz, S., Arian, Y.: Efficient cloud auto-scaling with SLA objective using Q-learning. In: Proc. of IEEE FiCloud ’18. pp. 85–92 (2018) 8. Jawarneh, I.M.A., Bellavista, P., Bosi, F., Foschini, L., Martuscelli, G., Monta- nari, R., Palopoli, A.: Container orchestration engines: A thorough functional and performance comparison. In: Proc. of IEEE ICC 2019. pp. 1–6 (2019) 9. Jerzak, Z., Ziekow, H.: The DEBS 2015 grand challenge. In: Proc. ACM DEBS 2015. pp. 266–268 (2015) 10. Khazaei, H., Ravichandiran, R., Park, B., Bannazadeh, H., Tizghadam, A., Leon- Garcia, A.: Elascale: Autoscaling and monitoring as a service. In: Proc. of CASCON ’17. pp. 234–240 (2017) 11. Mao, Y., Oak, J., Pompili, A., Beer, D., Han, T., Hu, P.: DRAPS: dynamic and resource-aware placement scheme for Docker containers in a heterogeneous cluster. In: Proc. of IEEE IPCCC ’17. pp. 1–8 (2017) 12. Nardelli, M., Cardellini, V., Casalicchio, E.: Multi-level elastic deployment of containerized applications in geo-distributed environments. In: Proc. of IEEE FiCloud ’18. pp. 1–8 (2018) 13. Rossi, F., Cardellini, V., Lo Presti, F.: Elastic deployment of software containers in geo-distributed computing environments. In: Proc. of IEEE ISCC ’19. pp. 1–7 (2019) 14. Rossi, F., Nardelli, M., Cardellini, V.: Horizontal and vertical scaling of container- based applications using Reinforcement Learning. In: Proc. of IEEE CLOUD ’19. pp. 329–338 (2019) 15. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2 edn. (2018) 16. Tang, Z., Zhou, X., Zhang, F., Jia, W., Zhao, W.: Migration modeling and learning algorithms for containers in fog computing. IEEE Trans. Serv. Comput. 12(5), 712–725 (2019) 17. Tesauro, G., Jong, N.K., Das, R., Bennani, M.N.: A hybrid Reinforcement Learning approach to autonomic resource allocation. In: Proc. of IEEE ICAC ’06. pp. 65–73 (2006) 18. Zhao, D., Mohamed, M., Ludwig, H.: Locality-aware scheduling for containers in cloud computing. IEEE Trans. Cloud Comput. (2018)