ImpalaE: Towards an optimal policy for efficient
resource management at the edge
Tania Lorido-Botrana , Muhammad Khurram Bhattib
a
    Bilbao, Spain
b
    Information Technology University, Arfa Software Technology Park, Ferozepur Road, Lahore, Pakistan


                                         Abstract
                                         Edge computing is an extension of cloud computing where physical servers are deployed closer to the
                                         users in order to reduce latency. Edge data centers face the challenge of serving a continuously increas-
                                         ing number of applications with a reduced capacity compared to traditional data center. This paper in-
                                         troduces ImpalaE, an agent based on Deep Reinforcement Learning that aims at optimizing the resource
                                         usage in edge data centers. First, it proposes modeling the problem as a Markov Decision Process, with
                                         two optimization objectives: reducing the number of physical servers used and maximize number of
                                         applications placed in the data center. Second, it introduces an agent based on Proximal Policy Opti-
                                         mization, for finding the optimal consolidation policy, and an asynchronous architecture with multiple
                                         workers-shared learner that enables for faster convergence, even with reduced amount of data. We show
                                         the potential in a simulated edge data center scenario with different VM sizes based on Microsoft Azure
                                         real traces, considering CPU, memory, disk and network requirements. Experiments show that ImpalaE
                                         effectively increases the number of VMs that can be placed per episode and that it quickly converges to
                                         an optimal policy.

                                         Keywords
                                         Edge Computing, Policy Gradient, Reinforcement Learning, Efficient Resource Management


1. Introduction
Cloud Computing providers have popularized and quickly replaced private data centers. Many
businesses, government organizations and research centers rely on external clouds to run their
workloads. However, Cloud data centers are usually located far away from the end-user and
the perceived latency might not be up to the standard. In recent years, the Edge Computing
paradigm has augmented Cloud capabilities by placing computing facilities and services close
to end users. Thus, Edge data centers are able to provide low latency and mobility to delay-
sensitive applications. According to a Markov Growth study [1], Edge Computing was valued
at USD 1.93 Billion in 2018 and is projected to reach USD 10.96 Billion by 2026. With this high
growth in revenue, it is clear the increased interest in this services.
   The Edge computing platform is expected to deliver consistent performance despite the rapid
increase of application demand, specially coming from Internet-of-Things applications, such us
QuaInT 2021: Workshop on the Quantum Information Technologies, April 11, 2021, Zhytomyr, Ukraine
doors 2021: Edge Computing Workshop, April 11, 2021, Zhytomyr, Ukraine
" tania.lorido@deusto.es (T. Lorido-Botran); khurram.bhatti@itu.edu.pk (M.K. Bhatti)
~ https://itu.edu.pk/faculty-itu/dr-khurram-bhatti/ (M.K. Bhatti)
 0000-0002-9132-4435 (T. Lorido-Botran); 0000-0002-1974-8268 (M.K. Bhatti)
                                       © 2021 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
self-sufficient vehicles producing data from their various cameras, radar or accelemerometers.
The new challenge for edge service providers is to perform efficient resource management of
their edge data centers with reduced computation and storage capabilities [2]. In particular,
providers will look for automated solutions that can adapt to the varying demand and diverse
workloads.
   Reinforcement Learning (RL) is a family of self-adaptive algorithms that has been success-
fully applied to multiple domains. From the popular AlphaGo [3] for playing the game of Go,
to autonomous driving, drug discovery, personalized recommendations and optimizing chem-
ical reactions. RL has also been applied to for cloud resource optimization, both horizontal
and vertical scalability [4] [5] [6]. Similarly, RL has the potential to provide and efficient and
automated solution to the management of resource at the Edge.


2. Related work
Edge computing has received increasing attention in recent years. A common use case scenario
is the off-loading of certain requests to different Edge data centers. Liu et al. [7] focus on
the task scheduling problem and proposed an RL-based scheduling solution and successfully
offload certain tasks to other data centers. Some authors have proposed DRL-based solutions
for the offloading of VMs [8]. However, computation offloading might lead to unbalancing
issues, as some edge data centers in the region could be overloaded while some others are in
idle state [9].
   Unbalanced data centers lead to performance degradation and wasted resources. One ap-
proach would be to spread the load equally among the difference edge data centers. Puthal
et al. [10] take this approach and propose a solution based on Bread-First-Search to keep the
application load equally distributed. However, edge data centers are characterized from scarce
resources compared to traditional servers and a load balancing approach will not maximize the
number of applications that can be served.
   There are clashing objectives between the end-user and the service provider. The end-user
expects guaranteed application performance, while the provider wants to maximize its revenue
by increasing the number serviced applications. In order to meet both end-user and provider’s
expectation, it seems reasonable to define the overall objective as a consolidation problem:
placing as many requests as possible using the minimal capacity, always subject to resource
constraints. With this goal in mind, some authors have focused on the execution of tasks on
edge data centers [7, 11]. Zhu et al. [11] successfully introduce two approximation scheduling
algorithms focused on minimizing energy consumption and reducing the overall task execution
delay.
   As stated by Khan et al. [2], edge data centers can benefit from the use of Virtual Machines
to co-allocate multiple applications in the same physical server. Tao et al. [12] gather a list of
proposed solutions that handle the VM placement on edge data centers. Proposed optimiza-
tion methods range from Mixed-Linear Non-Linear Programming [13, 14] to Particle Swarm
Optimization [15]. However, there seems to be a lack of solutions exploring the potential of RL
for optimal VM placement in edge data centers, aiming at minimizing resource wastage.
   To the best of our knowledge, this is the first attempt to explore the application of policy-


                                               72
gradient RL methods to achieve efficient resource management in edge data centers. This pa-
per introduces an agent (named ImpalaE that uses policy-gradient method to find the optimal
placement policy and a distributed architecture that enables fast training. The resource man-
agement problem is formulated with a bi-objective function that tries to (1) reduce the number
of physical servers utilized and (2) maximize the number of applications that can be placed in
the edge data center.


3. Background: Policy-Based Reinforcement Learning
The basic elements in an RL problem are the agent and the environment. The agent continuously
interacts with the environment, observes the current state and decides the best action to take.
After some time, the agent will observe the reward obtained after applying that action. The
goal is to learn an optimal policy 𝜋𝜃 (𝑠|𝑎) that maps each state with its optimal action.

3.1. Vanilla Policy Gradient (PG)
There are different approaches to learn the optimal policy. As the name suggests, Policy-based
algorithms directly learn the policy without an intermediary function. The policy 𝜋𝜃 (𝑠|𝑎) is
approximated with deep neural network that has a vector of policy parameters 𝜃. The goal is
to adjust the values of these parameters, such that the policy maximizes the reward obtained
from the environment.
   Policy gradient methods rely on applying stochastic gradient descent as an iterative pro-
cess. At each step, the algorithm estimates the gradient of some estimated scalar performance
objective 𝐽 (𝜃𝑘 ) and updates the policy parameters 𝜃:

                                       𝜃𝑘+1 = 𝜃𝑘 + 𝛼∇𝜃 𝐽 (𝜃𝑘 )                              (1)
  The gradient of 𝐽 (𝜋𝜃 ) for the Vanilla Policy Gradient can be calculated as follows:
                                            𝑇
                          ∇𝜃 𝐽 (𝜋𝜃 ) = 𝔼 ∑ ∇𝜃 log 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )𝐴𝜋𝜃 (𝑠𝑡 , 𝑎𝑡 ),               (2)
                                      𝜏 ∼𝜋𝜃 𝑡=0

where 𝜏 is an episode, that is a sequence of states and actions, e.g. a pre-defined sequence of
requests and their corresponding placements in the edge data center; and 𝔼 denotes calculating
average over a batch of samples.
  The main drawback in Vanilla PG is the high gradient variance, that will hinder the con-
vergence to an optimal policy. The advantage function 𝐴𝜋𝜃 included in the gradient function
helps in reducing such variance. Without going deep into the details, the advantage function
evaluates how good an action is compared to the average action for a specific state.

3.2. Proximal Policy Optimization (PPO)
PPO [16] aims to optimize the gradient update taken at each step, ensuring that it minimizes
the objective function, while ensuring that the difference to the previous policy is relatively


                                                  73
small. Too big of an update might cause a divergence from the optimal policy. PPO imposes a
constraint to the policy gradient updates as follows:

                    𝐽 (𝜃) = 𝐿𝐶𝐿𝐼 𝑃 (𝜃) = 𝔼𝑡 [𝑚𝑖𝑛(𝑟𝑡 (𝜃)𝐴𝑡 , 𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1 − 𝜖, 1 + 𝜖)𝐴𝑡 )]               (3)
   There are two main modifications with respect to the vanilla PG method. The first one is
𝑟𝑡 = 𝜋𝜋𝜃 𝜃 (𝑎(𝑎𝑡 |𝑠𝑡 |𝑠𝑡 )𝑡 ) , which computes a ratio between the current policy (after update) and the older
        𝑜𝑙𝑑
policy (just before the update).                        Additionally, PPO relies on a clipping function
𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1 − 𝜖, 1 + 𝜖) that keep the value or 𝑟𝑡 between certain range defined by 1 − 𝜖 and
1 + 𝜖.
   PPO with Clipping is used as the core agent for ImpalaE. The full logic is depicted in Algo-
rithm 1:
  Algorithm 1: PPO with clipping
   Input: initial policy parameters 𝜃0 , clipping threshold 𝜖
   for 0,1,2, … do do
        Collect set of partial trajectories (episodes) 𝜏 on policy 𝜋 = 𝜋(𝜃)
        Estimate advantages 𝐴𝑡 using any advantage estimation algorithm
        Update the policy by maximizing the policy the PPO-Clip objective:
           𝜃𝑘+1 = arg max𝜃 𝐿𝐶𝐿𝐼 𝑃 (𝜃𝑘 ), typically, by taking 𝐾 steps of minibatch stochastic
           gradient descent with Adam optimization
   end


3.3. Importance Weighted Actor-Learner Architectures (IMPALA)
IMPALA [17] is a state-of-the-art algorithm produced by DeepMind. It uses the vanilla Policy
Gradient at its core, but also introduces two significant improvements: a distributed archi-
tecture, and a correction algorithm V-trace. First, it introduces a highly-scalable architecture
that relies on a single (or multiple) learner and multiple workers (see figure 1. In traditional
RL approaches [18], each worker updates its local model parameters before each episode and
communicates gradients to the main learner. IMPALA proposes a loosely coupled architecture
where each worker focuses on collecting trajectories of experience (states, action, rewards).
Then, the learner asynchronously samples batches of experiences from the workers, computes
the policy gradients and updates the current model. This architecture enables the learner to be
accelerated by a GPU and to distribute the workers across different nodes and collect experi-
ence on different domains (e.g. independent edge data centers).
   The high scalability of the IMPALA architecture comes at a cost. Each worker interacts with
its environment based on a policy that is slightly older than the main learner’s policy, since
the learner broadcasts the updated weights in a period and asynchronous manner. In order to
address this divergence, Espeholt et al. [17] introduce a correction algorithm called V-trace that
readjusts the value function 𝑉 (𝑠) for each state and account for the lag in each action decision.


                                                      74
4. ImpalaE: efficient resource management at the Edge
This paper introduces ImpalaE, an agent designed to address the specific resource manage-
ment needs from Edge Computing paradigm. The agent specializes in edge data centers that
use Virtual Machines as an abstraction layer to place applications. It relies on the use of Policy
Gradient Reinforcement Learning to learn and adapt to different VM request arrival patterns
and dynamic resource usage. By leveraging a combination of PPO with an asynchronous archi-
tecture, it quickly finds the optimal placement policy that squeezes the maximum performance
out of the reduced capacity of an edge data center. As a first step, the Edge computing envi-
ronment is formulated to be suitable for an RL-based agent.

4.1. Environment modeling
The scenario is one or more edge data centers composed of 𝑛 physical servers. Each physical
servers has a given capacity for a set of resources, 𝑚. The agent has to learn the optimal
policy 𝜋 that matches each incoming request, expressed as a VM type with specific resource
requirements, with the best physical server available. The overall goal is to maximize the
number of requests that can be served given the current capacity. With this goal in mind, the
resource management problem on edge data centers can be formulated as a Markov Decision
Process (MDP) as follows:
   State space: The state 𝑠 at time 𝑡 is defined as the current resource usage in the data cen-
ter, together with the request received at time 𝑡. The resource usage of each physical server
is expressed as a normalized variable, ranged [0, 1], for each of the resources considered 𝑚.
Additionally, each physical server has a binary variable associated 𝑠, which indicates if it is
active (it has any load assigned to it) or not. Overall, the resource usage of the data center is a
multi-dimensional vector [𝑛, 𝑚 + 1]. Each request 𝑣 corresponds to a VM type, defined a set of
𝑚 resource requirements that need to be satisfied. For the current case, we will consider 𝑚 = 4
resources, namely CPU, memory, disk and network capacity.
   Action space: The action space 𝐴 is the set of 𝑛 physical servers available in the data center.
At time 𝑡, 𝐴𝑡 is defined as the subset of servers where the current request 𝑣 could be placed,
that is, never exceeding the capacity of the machine:
                                                  𝑚
                                 𝐴𝑡 = {𝑎 ∈ 𝐴|∑𝑖=1 𝑢𝑎,𝑖 + 𝑣𝑖 ≤ 1}                                (4)
where 𝑢𝑎,𝑖 is the current utilization value for physical server 𝑎 and resource 𝑖 and 𝑣𝑖 is the
capacity requested for resource 𝑖.
   Reward definition: The primary goal in the edge data center is to maximize the number of
requests that can be served with the available capacity. The reward function 𝑅 is defined with
this goal in mind and it is composed of two objectives. The first objective 𝑅1 accounts for the
amount of unused resources in the data center, normalized by the total capacity, 𝑛 ∗ 𝑚:

                                             ∑𝑛𝑖=1 𝑠𝑖 ∗ 𝑓𝑖
                                        𝑅1 = −                                           (5)
                                                𝑛∗𝑚
where 𝑓𝑖 is the total amount of free capacity across 𝑚 resources for physical server 𝑖. The
reward only accounts for free resources in active physical servers, defined with 𝑠𝑖 = 1.


                                                 75
Figure 1: Architecture for ImpalaE


  The second part of the reward function directly accounts for the number of requests remain-
ing to be placed in the current episode:
                                               𝑃 −𝑉
                                          𝑅2 = −                                          (6)
                                                  𝑉
  The final reward function is simply the linear combination of 𝑅1 and 𝑅2 with equal weights.

4.2. Agent architecture
The proposed agent is based on the asynchronous architecture introduced by [17], from which
it takes its name, ImpalaE. It consists of a main learner and one or more workers (see figure 1).
Each worker interacts with the environment using their local copy of the network (only per-
forming inference) and store (state, action, reward) samples. The main learner asynchronously
samples batches from each of the workers and uses them to update the central network. After
that, the learner broadcasts the network updated new weights to each of the learners in an
asynchronous manner. This architecture enables for faster, parallel collection of environment
info, which in turn leads for a quick convergence toward the optimal policy.
    The learner is based on PPO algorithm with clipping (see Algorithm 1) for finding the op-
timal policy, that is, the best placement of each incoming VM request to the edge data center.
The network model uses a shared architecture for the policy and the value function. It consists
of feed-forward neural network with TanH activation function. In order to speed up the con-
vergence, the learner makes use of a buffer replay. This buffer stores all the instances composed
of (𝑠𝑡𝑎𝑡𝑒, 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑟𝑒𝑤𝑎𝑟𝑑, 𝑛𝑒𝑥𝑡_𝑠𝑡𝑎𝑡𝑒). Periodically, the learner samples 𝑏𝑎𝑡𝑐ℎ_𝑠𝑖𝑧𝑒 instances
sampled from the buffer to perform a gradient update in the policy network. Finally, the learner
leverages V-trace[17], a correction algorithm that fixes discrepancies in the instances as a result


                                                76
Table 1
Parameter configuration for ImpalaE
        Type                          Parameter                    Symbol   Value
                                      Number of physical servers   𝑛        500
        Scenario                      Number of resources          𝑚        4
                                      Number of actions            |𝐴|      𝑛
                                      Learning rate                𝛼        0.005
                                      Train Batch size                      500
        ImpalaE                       Optimization algorithm                Adam
                                      Clipping parameter                    0.4
                                      Number of workers                     2
                                      Input layer                           (𝑛 + 1)*(𝑚 + 1)
                                      Hidden layer 1                        1024
        Network Model
                                      Hidden layer 2                        1024
                                      Output layer                          𝑛


of the asynchronous architecture. Table 1 contains a summary of the configuration used in the
experimental evaluation:


5. Experimental evaluation
The following set of experiments are defined to evaluate the general performance of ImpalaE,
compared against other policy-gradient methods from the state-of-art, and also the conver-
gence and scalability of the agent architecture.
   Testing environment: A simulated environment of an edge data center with certain num-
ber of homogeneous physical servers (same capacity). Each physical server and VM request is
defined in terms of their CPU, memory, network and disk requirements. The resource spec-
ification is normalized between 0 and 1 (required by the model input). The experiments are
based on real-world traces collected from Microsoft Azure data center [19, 20] (in particular,
15 VM types assigned to a machine identified with id 0). All algorithms are implemented in
Python v3.8 and models are implemented using Tensorflow v2.5.0, and trained on a GPU. The
hardware for the experiments is a machine with Intel Cor i7-10510U, 16GB of RAM, NVIDIA
GeForce MX330.
   Baseline methods: ImpalaE is compared against one heuristic method, Round Robin, and
two other state-of-the-art RL algorithms: (vanilla) Policy Gradient (PG) and Proximal Policy
Optimization (PPO).

5.1. Convergence and performance evaluation
The main goal of ImpalaE is to quickly converge to the optimal placement policy, the one that
optimizes resource usage and maximises the number of requests that can be accommodated in
the edge data center. In the first scenario, the data center is composed of 500 physical servers
and has enough capacity to serve an episode consisting of 1000 VM requests. Requests are ran-
domly drawn from a set of 14 VM types extracted from Azure data center traces (machineID 0).


                                                77
Figure 2: Training results for ImpalaE, PPO and PG


For fairness of results, the same network architecture is used for ImpalaE, PPO and PG. The
network contains 2 hidden layers, with 1024 units each. When the agent architecture allows,
two workers are used in the training process.
   Figure 2 shows the convergence results for ImpalaE, PPO, PG and Round Robin. In less than
30 iterations, ImpalaE quickly converges to the optimal policy. In contrast, both PG and PPO
achieve a sub-optimal policy (lower than the heuristic-based agent, Round Robin), with lower
mean reward per episode. PG takes a high number of iterations to converge.
   The second scenario is designed to stress the agent ability to make optimal placement de-
cision in cases of high occupancy. The data center consists again of 500 physical servers, but
in this case, 2000 VM requests have to be placed in each episode. The data center does not
have enough capacity to serve all of them. Figure 3 shows the percentage of placed requests,
calculated as the mean of the last 5 iterations. The heuristic-based agent (Round Robin) only
manages to accommodate 25% of the requests. This is inherent to the nature of Round Robin
algorithm, that tries to spread out the load across different nodes. This naturally leads to re-
source fragmentation and limits the amount of resources that can be placed in a data center.
In contrast, RL-based agents quickly learn a policy that tries to maximize the resource utiliza-
tion. Both state-of-the-art baseline methods, PPO and PG, achieve a higher rate of successful
placements in contrast to the heuristic agent, 89% and 91% respectively. Thanks to its parallel
architecture, ImpalaE agent is able to explore more scenarios in a shorter amount of time and
thus, further train the policy to score the highest placement rate, 94% of the 2000 VM requests
within the same edge data center.


                                               78
Figure 3: Mean percentage of placed requests per episode


5.2. Agent scalability
The single learner-multiple worker architecture makes the proposed agent highly scalable,
which in turns allows for faster convergence. The next experiment explores the impact of
the number of workers in the training process. The scenario uses 500 physical servers and
1000 VM requests per episode, and compares the performance of PPO and ImpalaE (see figure
4). As expected, PPO shows the slowest convergence rate, easily surpassed by ImpalaE with
a single worker. At its core, ImpalaE relies on several workers interacting with the environ-
ment and gathering as much information as possible, that is, they explore different data center
scenarios and placement decisions and record the outcome of such decision (did it improved
the request acceptance?). For this reason, increasing the number of works naturally improves
the placement policy (higher reward) and leads to an earlier convergence. In this particular
case, ImpalaE achieves the best results with 4 works. However, it is interesting to note that an
additional worker (up to 5) actually achieves a slightly worse policy, which might be due to
high variance in the sampling. We leave for future work the deeper analysis of the algorithm
stability during training.
   A well-known drawback of RL-based agents is their extremely long times (hours) needed to
converge to an optimal policy, which makes it unfeasible to deploy such agent in a production
environment. This experiment analyses the overall training time of the agent for a data center
composed of 500 physical servers. As figure 5 shows, the baseline method, PPO, requires around
37 minutes of total training time. In contrast, the parallel architecture of ImpalaE allows it to
further reduce the training time to only 4.4 minutes with 4 workers. This is especially appealing
feature for highly dynamic environments, where the workload request patterns and resource
usage change abruptly.


                                               79
Figure 4: Mean reward per episode during training time


Figure 5: Convergence time


6. Conclusions and future work
Edge computing was born as an extension of widely used Cloud computing, with the differences
that computing resources are located closer to the end-user and this is imperative for latency-
critical applications. Edge computing providers face an additional challenge when making
an optimal resource management of their data centers with reduced capacity, while trying to
meet the client demand. This paper introduces ImpalaE, an agent based on Deep Reinforcement
Learning, specially designed to optimize resource usage at the edge. It leverages Proximal Pol-


                                               80
icy Optimization for finding the best placement policy for applications in edge data centers. It
is also based on the IMPALA architecture, an asynchronous paradigm composed of one learner
and multiple parallel workers that speed up the convergence, even with reduced amount of
data. The paper also introduces modeling of the edge computing environment as a Markov
Decision Process with a bi-objective reward function specially designed to squeeze maximum
performance. The validity of ImpalaE is assessed in a simulated environment considering VM
requests based on real Microsoft Azure traces and considering CPU, memory, disk and network
requirements.
   The full potential of IMPALA architecture is yet to be explored. It has demonstrated higher
performance with less data and ability to transfer information among tasks [17]. One natural
extension would be to expand ImpalaE to multiple data centers, that learn an optimal policy
per data center, but also benefit from asynchronously exchanging information among different
agents. However, there is also a need for deeper experimentation about the training stability
for larger number of workers.
   The current environment model takes into account the network bandwidth needs of each
application. However, it could be further extended to consider the communication pattern
among different nodes or VMs within the application. The reward function could be augmented
with other objectives, such us application latency experienced by end-user or the data center
energy utilization.


References
 [1] M. I. Reports, Global Edge Computing Market Size, Status And Forecast 2020-2026, 2021-
     02.
 [2] W. Z. Khan, E. Ahmed, S. Hakak, I. Yaqoob, A. Ahmed, Edge computing: A survey, Future
     Generation Computer Systems 97 (2019) 219–235.
 [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-
     twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, Others, Mastering the game of
     Go with deep neural networks and tree search, nature 529 (2016) 484–489.
 [4] Z. Wang, C. Gwon, T. Oates, A. Iezzi, Automated cloud provisioning on aws using deep
     reinforcement learning, arXiv preprint arXiv:1709.04305 (2017).
 [5] B. Du, C. Wu, Z. Huang, Learning resource allocation and pricing for cloud profit maxi-
     mization, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33,
     2019, pp. 7570–7577.
 [6] S. Zhang, T. Wu, M. Pan, C. Zhang, Y. Yu, A-SARSA: A Predictive Container Auto-Scaling
     Algorithm Based on Reinforcement Learning, in: 2020 IEEE International Conference on
     Web Services (ICWS), IEEE, 2020, pp. 489–497.
 [7] J. Liu, Y. Mao, J. Zhang, K. B. Letaief, Delay-optimal computation task scheduling for
     mobile-edge computing systems, in: 2016 IEEE International Symposium on Information
     Theory (ISIT), IEEE, 2016, pp. 1451–1455.
 [8] X. Qiu, L. Liu, W. Chen, Z. Hong, Z. Zheng, Online deep reinforcement learning for com-
     putation offloading in blockchain-empowered mobile edge computing, IEEE Transactions
     on Vehicular Technology 68 (2019) 8050–8062.


                                              81
 [9] K.-K. R. Choo, R. Lu, L. Chen, X. Yi, A foggy research future: Advances and future oppor-
     tunities in fog computing research, 2018.
[10] D. Puthal, M. S. Obaidat, P. Nanda, M. Prasad, S. P. Mohanty, A. Y. Zomaya, Secure and
     sustainable load balancing of edge data centers in fog computing, IEEE Communications
     Magazine 56 (2018) 60–65.
[11] T. Zhu, T. Shi, J. Li, Z. Cai, X. Zhou, Task scheduling in deadline-aware mobile edge
     computing systems, IEEE Internet of Things Journal 6 (2018) 4854–4866.
[12] Z. Tao, Q. Xia, Z. Hao, C. Li, L. Ma, S. Yi, Q. Li, A survey of virtual machine management
     in edge computing, Proceedings of the IEEE 107 (2019) 1482–1499.
[13] Q. Fan, N. Ansari, Cost aware cloudlet placement for big data processing at the edge, in:
     2017 IEEE International Conference on Communications (ICC), IEEE, 2017, pp. 1–6.
[14] S. Mondal, G. Das, E. Wong, CCOMPASSION: A hybrid cloudlet placement framework
     over passive optical access networks, in: IEEE INFOCOM 2018-IEEE Conference on Com-
     puter Communications, IEEE, 2018, pp. 216–224.
[15] Y. Li, S. Wang, An energy-aware edge server placement algorithm in mobile edge com-
     puting, in: 2018 IEEE International Conference on Edge Computing (EDGE), IEEE, 2018,
     pp. 66–73.
[16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization
     algorithms, arXiv preprint arXiv:1707.06347 (2017).
[17] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu,
     T. Harley, I. Dunning, Others, Impala: Scalable distributed deep-rl with importance
     weighted actor-learner architectures, in: International Conference on Machine Learn-
     ing, PMLR, 2018, pp. 1407–1416.
[18] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu,
     Asynchronous methods for deep reinforcement learning, in: International conference on
     machine learning, PMLR, 2016, pp. 1928–1937.
[19] O. Hadary, L. Marshall, I. Menache, A. Pan, E. E. Greeff, D. Dion, S. Dorminey, S. Joshi,
     Y. Chen, M. Russinovich, Others, Protean: VM Allocation Service at Scale, in: 14th
     USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020,
     pp. 845–861.
[20] Trace,     Azure Public Dataset,              2021-02. URL: https://github.com/Azure/
     AzurePublicDataset.


                                                82