ImpalaE: Towards an optimal policy for efficient resource management at the edge Tania Lorido-Botrana , Muhammad Khurram Bhattib a Bilbao, Spain b Information Technology University, Arfa Software Technology Park, Ferozepur Road, Lahore, Pakistan Abstract Edge computing is an extension of cloud computing where physical servers are deployed closer to the users in order to reduce latency. Edge data centers face the challenge of serving a continuously increas- ing number of applications with a reduced capacity compared to traditional data center. This paper in- troduces ImpalaE, an agent based on Deep Reinforcement Learning that aims at optimizing the resource usage in edge data centers. First, it proposes modeling the problem as a Markov Decision Process, with two optimization objectives: reducing the number of physical servers used and maximize number of applications placed in the data center. Second, it introduces an agent based on Proximal Policy Opti- mization, for finding the optimal consolidation policy, and an asynchronous architecture with multiple workers-shared learner that enables for faster convergence, even with reduced amount of data. We show the potential in a simulated edge data center scenario with different VM sizes based on Microsoft Azure real traces, considering CPU, memory, disk and network requirements. Experiments show that ImpalaE effectively increases the number of VMs that can be placed per episode and that it quickly converges to an optimal policy. Keywords Edge Computing, Policy Gradient, Reinforcement Learning, Efficient Resource Management 1. Introduction Cloud Computing providers have popularized and quickly replaced private data centers. Many businesses, government organizations and research centers rely on external clouds to run their workloads. However, Cloud data centers are usually located far away from the end-user and the perceived latency might not be up to the standard. In recent years, the Edge Computing paradigm has augmented Cloud capabilities by placing computing facilities and services close to end users. Thus, Edge data centers are able to provide low latency and mobility to delay- sensitive applications. According to a Markov Growth study [1], Edge Computing was valued at USD 1.93 Billion in 2018 and is projected to reach USD 10.96 Billion by 2026. With this high growth in revenue, it is clear the increased interest in this services. The Edge computing platform is expected to deliver consistent performance despite the rapid increase of application demand, specially coming from Internet-of-Things applications, such us QuaInT 2021: Workshop on the Quantum Information Technologies, April 11, 2021, Zhytomyr, Ukraine doors 2021: Edge Computing Workshop, April 11, 2021, Zhytomyr, Ukraine " tania.lorido@deusto.es (T. Lorido-Botran); khurram.bhatti@itu.edu.pk (M.K. Bhatti) ~ https://itu.edu.pk/faculty-itu/dr-khurram-bhatti/ (M.K. Bhatti)  0000-0002-9132-4435 (T. Lorido-Botran); 0000-0002-1974-8268 (M.K. Bhatti) Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) self-sufficient vehicles producing data from their various cameras, radar or accelemerometers. The new challenge for edge service providers is to perform efficient resource management of their edge data centers with reduced computation and storage capabilities [2]. In particular, providers will look for automated solutions that can adapt to the varying demand and diverse workloads. Reinforcement Learning (RL) is a family of self-adaptive algorithms that has been success- fully applied to multiple domains. From the popular AlphaGo [3] for playing the game of Go, to autonomous driving, drug discovery, personalized recommendations and optimizing chem- ical reactions. RL has also been applied to for cloud resource optimization, both horizontal and vertical scalability [4] [5] [6]. Similarly, RL has the potential to provide and efficient and automated solution to the management of resource at the Edge. 2. Related work Edge computing has received increasing attention in recent years. A common use case scenario is the off-loading of certain requests to different Edge data centers. Liu et al. [7] focus on the task scheduling problem and proposed an RL-based scheduling solution and successfully offload certain tasks to other data centers. Some authors have proposed DRL-based solutions for the offloading of VMs [8]. However, computation offloading might lead to unbalancing issues, as some edge data centers in the region could be overloaded while some others are in idle state [9]. Unbalanced data centers lead to performance degradation and wasted resources. One ap- proach would be to spread the load equally among the difference edge data centers. Puthal et al. [10] take this approach and propose a solution based on Bread-First-Search to keep the application load equally distributed. However, edge data centers are characterized from scarce resources compared to traditional servers and a load balancing approach will not maximize the number of applications that can be served. There are clashing objectives between the end-user and the service provider. The end-user expects guaranteed application performance, while the provider wants to maximize its revenue by increasing the number serviced applications. In order to meet both end-user and provider’s expectation, it seems reasonable to define the overall objective as a consolidation problem: placing as many requests as possible using the minimal capacity, always subject to resource constraints. With this goal in mind, some authors have focused on the execution of tasks on edge data centers [7, 11]. Zhu et al. [11] successfully introduce two approximation scheduling algorithms focused on minimizing energy consumption and reducing the overall task execution delay. As stated by Khan et al. [2], edge data centers can benefit from the use of Virtual Machines to co-allocate multiple applications in the same physical server. Tao et al. [12] gather a list of proposed solutions that handle the VM placement on edge data centers. Proposed optimiza- tion methods range from Mixed-Linear Non-Linear Programming [13, 14] to Particle Swarm Optimization [15]. However, there seems to be a lack of solutions exploring the potential of RL for optimal VM placement in edge data centers, aiming at minimizing resource wastage. To the best of our knowledge, this is the first attempt to explore the application of policy- 72 gradient RL methods to achieve efficient resource management in edge data centers. This pa- per introduces an agent (named ImpalaE that uses policy-gradient method to find the optimal placement policy and a distributed architecture that enables fast training. The resource man- agement problem is formulated with a bi-objective function that tries to (1) reduce the number of physical servers utilized and (2) maximize the number of applications that can be placed in the edge data center. 3. Background: Policy-Based Reinforcement Learning The basic elements in an RL problem are the agent and the environment. The agent continuously interacts with the environment, observes the current state and decides the best action to take. After some time, the agent will observe the reward obtained after applying that action. The goal is to learn an optimal policy πœ‹πœƒ (𝑠|π‘Ž) that maps each state with its optimal action. 3.1. Vanilla Policy Gradient (PG) There are different approaches to learn the optimal policy. As the name suggests, Policy-based algorithms directly learn the policy without an intermediary function. The policy πœ‹πœƒ (𝑠|π‘Ž) is approximated with deep neural network that has a vector of policy parameters πœƒ. The goal is to adjust the values of these parameters, such that the policy maximizes the reward obtained from the environment. Policy gradient methods rely on applying stochastic gradient descent as an iterative pro- cess. At each step, the algorithm estimates the gradient of some estimated scalar performance objective 𝐽 (πœƒπ‘˜ ) and updates the policy parameters πœƒ: πœƒπ‘˜+1 = πœƒπ‘˜ + π›Όβˆ‡πœƒ 𝐽 (πœƒπ‘˜ ) (1) The gradient of 𝐽 (πœ‹πœƒ ) for the Vanilla Policy Gradient can be calculated as follows: 𝑇 βˆ‡πœƒ 𝐽 (πœ‹πœƒ ) = 𝔼 βˆ‘ βˆ‡πœƒ log πœ‹πœƒ (π‘Žπ‘‘ |𝑠𝑑 )π΄πœ‹πœƒ (𝑠𝑑 , π‘Žπ‘‘ ), (2) 𝜏 βˆΌπœ‹πœƒ 𝑑=0 where 𝜏 is an episode, that is a sequence of states and actions, e.g. a pre-defined sequence of requests and their corresponding placements in the edge data center; and 𝔼 denotes calculating average over a batch of samples. The main drawback in Vanilla PG is the high gradient variance, that will hinder the con- vergence to an optimal policy. The advantage function π΄πœ‹πœƒ included in the gradient function helps in reducing such variance. Without going deep into the details, the advantage function evaluates how good an action is compared to the average action for a specific state. 3.2. Proximal Policy Optimization (PPO) PPO [16] aims to optimize the gradient update taken at each step, ensuring that it minimizes the objective function, while ensuring that the difference to the previous policy is relatively 73 small. Too big of an update might cause a divergence from the optimal policy. PPO imposes a constraint to the policy gradient updates as follows: 𝐽 (πœƒ) = 𝐿𝐢𝐿𝐼 𝑃 (πœƒ) = 𝔼𝑑 [π‘šπ‘–π‘›(π‘Ÿπ‘‘ (πœƒ)𝐴𝑑 , 𝑐𝑙𝑖𝑝(π‘Ÿπ‘‘ (πœƒ), 1 βˆ’ πœ–, 1 + πœ–)𝐴𝑑 )] (3) There are two main modifications with respect to the vanilla PG method. The first one is π‘Ÿπ‘‘ = πœ‹πœ‹πœƒ πœƒ (π‘Ž(π‘Žπ‘‘ |𝑠𝑑 |𝑠𝑑 )𝑑 ) , which computes a ratio between the current policy (after update) and the older π‘œπ‘™π‘‘ policy (just before the update). Additionally, PPO relies on a clipping function 𝑐𝑙𝑖𝑝(π‘Ÿπ‘‘ (πœƒ), 1 βˆ’ πœ–, 1 + πœ–) that keep the value or π‘Ÿπ‘‘ between certain range defined by 1 βˆ’ πœ– and 1 + πœ–. PPO with Clipping is used as the core agent for ImpalaE. The full logic is depicted in Algo- rithm 1: Algorithm 1: PPO with clipping Input: initial policy parameters πœƒ0 , clipping threshold πœ– for 0,1,2, … do do Collect set of partial trajectories (episodes) 𝜏 on policy πœ‹ = πœ‹(πœƒ) Estimate advantages 𝐴𝑑 using any advantage estimation algorithm Update the policy by maximizing the policy the PPO-Clip objective: πœƒπ‘˜+1 = arg maxπœƒ 𝐿𝐢𝐿𝐼 𝑃 (πœƒπ‘˜ ), typically, by taking 𝐾 steps of minibatch stochastic gradient descent with Adam optimization end 3.3. Importance Weighted Actor-Learner Architectures (IMPALA) IMPALA [17] is a state-of-the-art algorithm produced by DeepMind. It uses the vanilla Policy Gradient at its core, but also introduces two significant improvements: a distributed archi- tecture, and a correction algorithm V-trace. First, it introduces a highly-scalable architecture that relies on a single (or multiple) learner and multiple workers (see figure 1. In traditional RL approaches [18], each worker updates its local model parameters before each episode and communicates gradients to the main learner. IMPALA proposes a loosely coupled architecture where each worker focuses on collecting trajectories of experience (states, action, rewards). Then, the learner asynchronously samples batches of experiences from the workers, computes the policy gradients and updates the current model. This architecture enables the learner to be accelerated by a GPU and to distribute the workers across different nodes and collect experi- ence on different domains (e.g. independent edge data centers). The high scalability of the IMPALA architecture comes at a cost. Each worker interacts with its environment based on a policy that is slightly older than the main learner’s policy, since the learner broadcasts the updated weights in a period and asynchronous manner. In order to address this divergence, Espeholt et al. [17] introduce a correction algorithm called V-trace that readjusts the value function 𝑉 (𝑠) for each state and account for the lag in each action decision. 74 4. ImpalaE: efficient resource management at the Edge This paper introduces ImpalaE, an agent designed to address the specific resource manage- ment needs from Edge Computing paradigm. The agent specializes in edge data centers that use Virtual Machines as an abstraction layer to place applications. It relies on the use of Policy Gradient Reinforcement Learning to learn and adapt to different VM request arrival patterns and dynamic resource usage. By leveraging a combination of PPO with an asynchronous archi- tecture, it quickly finds the optimal placement policy that squeezes the maximum performance out of the reduced capacity of an edge data center. As a first step, the Edge computing envi- ronment is formulated to be suitable for an RL-based agent. 4.1. Environment modeling The scenario is one or more edge data centers composed of 𝑛 physical servers. Each physical servers has a given capacity for a set of resources, π‘š. The agent has to learn the optimal policy πœ‹ that matches each incoming request, expressed as a VM type with specific resource requirements, with the best physical server available. The overall goal is to maximize the number of requests that can be served given the current capacity. With this goal in mind, the resource management problem on edge data centers can be formulated as a Markov Decision Process (MDP) as follows: State space: The state 𝑠 at time 𝑑 is defined as the current resource usage in the data cen- ter, together with the request received at time 𝑑. The resource usage of each physical server is expressed as a normalized variable, ranged [0, 1], for each of the resources considered π‘š. Additionally, each physical server has a binary variable associated 𝑠, which indicates if it is active (it has any load assigned to it) or not. Overall, the resource usage of the data center is a multi-dimensional vector [𝑛, π‘š + 1]. Each request 𝑣 corresponds to a VM type, defined a set of π‘š resource requirements that need to be satisfied. For the current case, we will consider π‘š = 4 resources, namely CPU, memory, disk and network capacity. Action space: The action space 𝐴 is the set of 𝑛 physical servers available in the data center. At time 𝑑, 𝐴𝑑 is defined as the subset of servers where the current request 𝑣 could be placed, that is, never exceeding the capacity of the machine: π‘š 𝐴𝑑 = {π‘Ž ∈ 𝐴|βˆ‘π‘–=1 π‘’π‘Ž,𝑖 + 𝑣𝑖 ≀ 1} (4) where π‘’π‘Ž,𝑖 is the current utilization value for physical server π‘Ž and resource 𝑖 and 𝑣𝑖 is the capacity requested for resource 𝑖. Reward definition: The primary goal in the edge data center is to maximize the number of requests that can be served with the available capacity. The reward function 𝑅 is defined with this goal in mind and it is composed of two objectives. The first objective 𝑅1 accounts for the amount of unused resources in the data center, normalized by the total capacity, 𝑛 βˆ— π‘š: βˆ‘π‘›π‘–=1 𝑠𝑖 βˆ— 𝑓𝑖 𝑅1 = βˆ’ (5) π‘›βˆ—π‘š where 𝑓𝑖 is the total amount of free capacity across π‘š resources for physical server 𝑖. The reward only accounts for free resources in active physical servers, defined with 𝑠𝑖 = 1. 75 Figure 1: Architecture for ImpalaE The second part of the reward function directly accounts for the number of requests remain- ing to be placed in the current episode: 𝑃 βˆ’π‘‰ 𝑅2 = βˆ’ (6) 𝑉 The final reward function is simply the linear combination of 𝑅1 and 𝑅2 with equal weights. 4.2. Agent architecture The proposed agent is based on the asynchronous architecture introduced by [17], from which it takes its name, ImpalaE. It consists of a main learner and one or more workers (see figure 1). Each worker interacts with the environment using their local copy of the network (only per- forming inference) and store (state, action, reward) samples. The main learner asynchronously samples batches from each of the workers and uses them to update the central network. After that, the learner broadcasts the network updated new weights to each of the learners in an asynchronous manner. This architecture enables for faster, parallel collection of environment info, which in turn leads for a quick convergence toward the optimal policy. The learner is based on PPO algorithm with clipping (see Algorithm 1) for finding the op- timal policy, that is, the best placement of each incoming VM request to the edge data center. The network model uses a shared architecture for the policy and the value function. It consists of feed-forward neural network with TanH activation function. In order to speed up the con- vergence, the learner makes use of a buffer replay. This buffer stores all the instances composed of (π‘ π‘‘π‘Žπ‘‘π‘’, π‘Žπ‘π‘‘π‘–π‘œπ‘›, π‘Ÿπ‘’π‘€π‘Žπ‘Ÿπ‘‘, 𝑛𝑒π‘₯𝑑_π‘ π‘‘π‘Žπ‘‘π‘’). Periodically, the learner samples π‘π‘Žπ‘‘π‘β„Ž_𝑠𝑖𝑧𝑒 instances sampled from the buffer to perform a gradient update in the policy network. Finally, the learner leverages V-trace[17], a correction algorithm that fixes discrepancies in the instances as a result 76 Table 1 Parameter configuration for ImpalaE Type Parameter Symbol Value Number of physical servers 𝑛 500 Scenario Number of resources π‘š 4 Number of actions |𝐴| 𝑛 Learning rate 𝛼 0.005 Train Batch size 500 ImpalaE Optimization algorithm Adam Clipping parameter 0.4 Number of workers 2 Input layer (𝑛 + 1)*(π‘š + 1) Hidden layer 1 1024 Network Model Hidden layer 2 1024 Output layer 𝑛 of the asynchronous architecture. Table 1 contains a summary of the configuration used in the experimental evaluation: 5. Experimental evaluation The following set of experiments are defined to evaluate the general performance of ImpalaE, compared against other policy-gradient methods from the state-of-art, and also the conver- gence and scalability of the agent architecture. Testing environment: A simulated environment of an edge data center with certain num- ber of homogeneous physical servers (same capacity). Each physical server and VM request is defined in terms of their CPU, memory, network and disk requirements. The resource spec- ification is normalized between 0 and 1 (required by the model input). The experiments are based on real-world traces collected from Microsoft Azure data center [19, 20] (in particular, 15 VM types assigned to a machine identified with id 0). All algorithms are implemented in Python v3.8 and models are implemented using Tensorflow v2.5.0, and trained on a GPU. The hardware for the experiments is a machine with Intel Cor i7-10510U, 16GB of RAM, NVIDIA GeForce MX330. Baseline methods: ImpalaE is compared against one heuristic method, Round Robin, and two other state-of-the-art RL algorithms: (vanilla) Policy Gradient (PG) and Proximal Policy Optimization (PPO). 5.1. Convergence and performance evaluation The main goal of ImpalaE is to quickly converge to the optimal placement policy, the one that optimizes resource usage and maximises the number of requests that can be accommodated in the edge data center. In the first scenario, the data center is composed of 500 physical servers and has enough capacity to serve an episode consisting of 1000 VM requests. Requests are ran- domly drawn from a set of 14 VM types extracted from Azure data center traces (machineID 0). 77 Figure 2: Training results for ImpalaE, PPO and PG For fairness of results, the same network architecture is used for ImpalaE, PPO and PG. The network contains 2 hidden layers, with 1024 units each. When the agent architecture allows, two workers are used in the training process. Figure 2 shows the convergence results for ImpalaE, PPO, PG and Round Robin. In less than 30 iterations, ImpalaE quickly converges to the optimal policy. In contrast, both PG and PPO achieve a sub-optimal policy (lower than the heuristic-based agent, Round Robin), with lower mean reward per episode. PG takes a high number of iterations to converge. The second scenario is designed to stress the agent ability to make optimal placement de- cision in cases of high occupancy. The data center consists again of 500 physical servers, but in this case, 2000 VM requests have to be placed in each episode. The data center does not have enough capacity to serve all of them. Figure 3 shows the percentage of placed requests, calculated as the mean of the last 5 iterations. The heuristic-based agent (Round Robin) only manages to accommodate 25% of the requests. This is inherent to the nature of Round Robin algorithm, that tries to spread out the load across different nodes. This naturally leads to re- source fragmentation and limits the amount of resources that can be placed in a data center. In contrast, RL-based agents quickly learn a policy that tries to maximize the resource utiliza- tion. Both state-of-the-art baseline methods, PPO and PG, achieve a higher rate of successful placements in contrast to the heuristic agent, 89% and 91% respectively. Thanks to its parallel architecture, ImpalaE agent is able to explore more scenarios in a shorter amount of time and thus, further train the policy to score the highest placement rate, 94% of the 2000 VM requests within the same edge data center. 78 Figure 3: Mean percentage of placed requests per episode 5.2. Agent scalability The single learner-multiple worker architecture makes the proposed agent highly scalable, which in turns allows for faster convergence. The next experiment explores the impact of the number of workers in the training process. The scenario uses 500 physical servers and 1000 VM requests per episode, and compares the performance of PPO and ImpalaE (see figure 4). As expected, PPO shows the slowest convergence rate, easily surpassed by ImpalaE with a single worker. At its core, ImpalaE relies on several workers interacting with the environ- ment and gathering as much information as possible, that is, they explore different data center scenarios and placement decisions and record the outcome of such decision (did it improved the request acceptance?). For this reason, increasing the number of works naturally improves the placement policy (higher reward) and leads to an earlier convergence. In this particular case, ImpalaE achieves the best results with 4 works. However, it is interesting to note that an additional worker (up to 5) actually achieves a slightly worse policy, which might be due to high variance in the sampling. We leave for future work the deeper analysis of the algorithm stability during training. A well-known drawback of RL-based agents is their extremely long times (hours) needed to converge to an optimal policy, which makes it unfeasible to deploy such agent in a production environment. This experiment analyses the overall training time of the agent for a data center composed of 500 physical servers. As figure 5 shows, the baseline method, PPO, requires around 37 minutes of total training time. In contrast, the parallel architecture of ImpalaE allows it to further reduce the training time to only 4.4 minutes with 4 workers. This is especially appealing feature for highly dynamic environments, where the workload request patterns and resource usage change abruptly. 79 Figure 4: Mean reward per episode during training time Figure 5: Convergence time 6. Conclusions and future work Edge computing was born as an extension of widely used Cloud computing, with the differences that computing resources are located closer to the end-user and this is imperative for latency- critical applications. Edge computing providers face an additional challenge when making an optimal resource management of their data centers with reduced capacity, while trying to meet the client demand. This paper introduces ImpalaE, an agent based on Deep Reinforcement Learning, specially designed to optimize resource usage at the edge. It leverages Proximal Pol- 80 icy Optimization for finding the best placement policy for applications in edge data centers. It is also based on the IMPALA architecture, an asynchronous paradigm composed of one learner and multiple parallel workers that speed up the convergence, even with reduced amount of data. The paper also introduces modeling of the edge computing environment as a Markov Decision Process with a bi-objective reward function specially designed to squeeze maximum performance. The validity of ImpalaE is assessed in a simulated environment considering VM requests based on real Microsoft Azure traces and considering CPU, memory, disk and network requirements. The full potential of IMPALA architecture is yet to be explored. It has demonstrated higher performance with less data and ability to transfer information among tasks [17]. One natural extension would be to expand ImpalaE to multiple data centers, that learn an optimal policy per data center, but also benefit from asynchronously exchanging information among different agents. However, there is also a need for deeper experimentation about the training stability for larger number of workers. The current environment model takes into account the network bandwidth needs of each application. However, it could be further extended to consider the communication pattern among different nodes or VMs within the application. The reward function could be augmented with other objectives, such us application latency experienced by end-user or the data center energy utilization. References [1] M. I. Reports, Global Edge Computing Market Size, Status And Forecast 2020-2026, 2021- 02. [2] W. Z. Khan, E. Ahmed, S. Hakak, I. Yaqoob, A. Ahmed, Edge computing: A survey, Future Generation Computer Systems 97 (2019) 219–235. [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit- twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, Others, Mastering the game of Go with deep neural networks and tree search, nature 529 (2016) 484–489. [4] Z. Wang, C. Gwon, T. Oates, A. Iezzi, Automated cloud provisioning on aws using deep reinforcement learning, arXiv preprint arXiv:1709.04305 (2017). [5] B. Du, C. Wu, Z. Huang, Learning resource allocation and pricing for cloud profit maxi- mization, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp. 7570–7577. [6] S. Zhang, T. Wu, M. Pan, C. Zhang, Y. Yu, A-SARSA: A Predictive Container Auto-Scaling Algorithm Based on Reinforcement Learning, in: 2020 IEEE International Conference on Web Services (ICWS), IEEE, 2020, pp. 489–497. [7] J. Liu, Y. Mao, J. Zhang, K. B. Letaief, Delay-optimal computation task scheduling for mobile-edge computing systems, in: 2016 IEEE International Symposium on Information Theory (ISIT), IEEE, 2016, pp. 1451–1455. [8] X. Qiu, L. Liu, W. Chen, Z. Hong, Z. Zheng, Online deep reinforcement learning for com- putation offloading in blockchain-empowered mobile edge computing, IEEE Transactions on Vehicular Technology 68 (2019) 8050–8062. 81 [9] K.-K. R. Choo, R. Lu, L. Chen, X. Yi, A foggy research future: Advances and future oppor- tunities in fog computing research, 2018. [10] D. Puthal, M. S. Obaidat, P. Nanda, M. Prasad, S. P. Mohanty, A. Y. Zomaya, Secure and sustainable load balancing of edge data centers in fog computing, IEEE Communications Magazine 56 (2018) 60–65. [11] T. Zhu, T. Shi, J. Li, Z. Cai, X. Zhou, Task scheduling in deadline-aware mobile edge computing systems, IEEE Internet of Things Journal 6 (2018) 4854–4866. [12] Z. Tao, Q. Xia, Z. Hao, C. Li, L. Ma, S. Yi, Q. Li, A survey of virtual machine management in edge computing, Proceedings of the IEEE 107 (2019) 1482–1499. [13] Q. Fan, N. Ansari, Cost aware cloudlet placement for big data processing at the edge, in: 2017 IEEE International Conference on Communications (ICC), IEEE, 2017, pp. 1–6. [14] S. Mondal, G. Das, E. Wong, CCOMPASSION: A hybrid cloudlet placement framework over passive optical access networks, in: IEEE INFOCOM 2018-IEEE Conference on Com- puter Communications, IEEE, 2018, pp. 216–224. [15] Y. Li, S. Wang, An energy-aware edge server placement algorithm in mobile edge com- puting, in: 2018 IEEE International Conference on Edge Computing (EDGE), IEEE, 2018, pp. 66–73. [16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017). [17] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, Others, Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, in: International Conference on Machine Learn- ing, PMLR, 2018, pp. 1407–1416. [18] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, PMLR, 2016, pp. 1928–1937. [19] O. Hadary, L. Marshall, I. Menache, A. Pan, E. E. Greeff, D. Dion, S. Dorminey, S. Joshi, Y. Chen, M. Russinovich, Others, Protean: VM Allocation Service at Scale, in: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 845–861. [20] Trace, Azure Public Dataset, 2021-02. URL: https://github.com/Azure/ AzurePublicDataset. 82