<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Deep Reinforcement Learning approach for hierarchical edge devices collaboration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamed Amine Ghamri</string-name>
          <email>mohammedamine.ghamri@emp.mdn.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Badis Djamaa</string-name>
          <email>badis.djamaa@emp.mdn.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamed Akrem Benatia</string-name>
          <email>akrem.benatia@emp.mdn.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Issam Eddine Lakhlef</string-name>
          <email>issameddine.lakhlef@emp.mdn.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ecole Militaire Polytechnique</institution>
          ,
          <addr-line>Bordj El Bahri, Algiers</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Artificial intelligence has become an integral part of modern life, with pervasive AI leveraging interconnected devices to deliver intelligent services in real-time environments with the most challenging aspects emerging in the domain of mobile edge computing (MEC). Deploying deep neural networks (DNNs) on such devices and orchestrating the necessary communication require dynamic coordination, an area where reinforcement learning (RL) has shown significant promise. In this paper, we propose a Double Deep Q-Network (DDQN) approach for DNN inference ofloading within a mobile edge computing system, eficiently addressing the challenges of dynamic orchestration while optimizing hierarchical collaboration between edge devices. Our experimental results demonstrate the strength of this approach, achieving superior performance in optimizing inference ofloading compared to existing solutions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Reinforcement Learning</kwd>
        <kwd>Deep Neural Network</kwd>
        <kwd>task ofloading</kwd>
        <kwd>DDQN</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Artificial intelligence (AI) has found applications in numerous domains, ofering transformative solutions
across industries. In particular, deep neural networks (DNNs) have proven efe ctive in tackling complex
challenges, driving innovation in areas such as healthcare [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], autonomous systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and smart
cities [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To fully harness the power of these models, we aim to make AI pervasive—deploying these
sophisticated models across interconnected devices to provide intelligence anytime and anywhere [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
This vision of pervasive AI involves deploying DNNs at difer ent levels of the Internet of Things (IoT)
stack [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], enabling distributed intelligence through the use of connected objects.
      </p>
      <p>
        One key challenge in deploying DNNs in IoT environments is the tradeof between model accuracy and
the computing requirements associated with the limited hardware capabilities of many IoT nodes [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
An efe ctive strategy to address this challenge is the use of split computing [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where DNNs are
partitioned across difer ent layers of the IoT stack to distribute the computational load eficiently [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
This approach involves dynamic coordination of data exchanges between difer ent nodes, which can
be optimally managed through reinforcement learning (RL). Recognizing the potential of RL, in this
paper, we propose a Double Deep Q-Network (DDQN) based approach to make dynamic DNN inference
ofloading decisions, ensuring eficient operation across resource-constrained environments. Our
results demonstrate the efe ctiveness of this method, showing promising improvements over existing
approaches.
      </p>
      <p>The remainder of this paper is structured as follows: Section 2 provides an overview of related work,
Section 3 details our proposed approach, and Section 4 presents the results of our experiments. Finally,
we conclude by summarizing our contributions and ofering directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Reinforcement learning (RL) has proven to be an efective approach for handling sequential
decisionmaking problems, especially in dynamic environments. This makes it well-suited for managing system
dynamics in DNN inference ofloading. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the authors proposed MECI, an RL-based approach
built on Q-learning, where each device maintains its own Q-table to dynamically decide on ofloading
decisions, such as selecting one of several possible cut layers and choosing the target server. While
this approach efectively handled the problem, its main limitation lies in the inability of Q-learning
to cope with complex state spaces, especially as the number of devices grows, which is a typical
scenario as stated in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. To address this limitation, the authors of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] later introduced DMECI, a deep
reinforcement learning (DRL) approach based on an actor-critic framework. In this method, the Q-value
estimation is replaced by deep neural network (DNN) function approximators, leveraging both target
and policy networks. Similarly, the work presented in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] addressed DNN inference ofloading with a
simplified setup involving only a single cut layer. They used an improved version of Deep Q-Networks
(DQN), which employs two neural networks for value estimation and an additional replay memory to
accelerate convergence. Their goal was to carry out classification tasks until a target confidence level
was reached.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the authors addressed DNN inference distribution in an IoT environment with a focus
on security and hardware limitations. Their innovative approach involved distributing portions of
intermediate feature maps across diferent devices for a collaborative inference process, which they
optimized using the DQN algorithm, showing promising results. Additionally, in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the authors
framed the inference distribution as a Markov decision process (MDP) in a mobile edge computing
(MEC) environment for signal classification, targeting distributed inference with high accuracy. They
adopted an actor-critic framework with the Deep Deterministic Policy Gradient (DDPG) algorithm,
which produced favorable outcomes. In the context of vehicular edge computing, the work in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
tackled task ofloading by employing a deep RL approach using an actor-critic structure with the DDPG
algorithm. Moreover, in another MEC environment, [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] framed the task ofloading problem as a
multi-objective optimization targeting quality of service (QoS) requirements using the DQN algorithm.
      </p>
      <p>As summarized in Table 1, existing studies commonly address the distributed inference ofloading
problem by leveraging reinforcement learning (RL) approaches and their deep variants, which have
proven efective in handling complex state spaces. These RL methods involve formulating the problem
as a Markov Decision Process (MDP), requiring a well-designed state space, action space, and a reward
function aligned with the desired objectives.</p>
      <p>
        These studies can be categorized into policy-based and value-based approaches. Policy-based methods,
such as those employing the actor-critic framework [
        <xref ref-type="bibr" rid="ref13 ref14 ref9">14, 13, 9</xref>
        ], use two neural networks: the critic
network for estimating value functions and the actor network for estimating the policy. Value-based
approaches, on the other hand, focus on estimating the quality or value function for a given state and
action, and then follow a greedy policy to derive the optimal course of action. This approach resembles
the trial-and-error process of an agent learning to navigate a stochastic environment. One notable
algorithm in the value-based category is the Deep Q-Network (DQN), which utilizes two sets of neural
networks. Improvements to DQN, such as the addition of a replay memory in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], have significantly
enhanced its performance. A further advancement is the Double Deep Q-Network (DDQN), introduced
in [16], which has demonstrated considerable success in both edge and cloud environments [17, 18].
Given its efectiveness, DDQN forms the foundation of our work. In our study, we tackle the problem
of DNN inference ofloading using the DDQN algorithm that will be detailed in next section.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. DDQN approach</title>
      <sec id="sec-3-1">
        <title>3.1. DDQN algorithm</title>
        <p>The Double DQN algorithm [16] aims to mitigate the overestimation problem found in traditional
Q-learning. In standard Q-learning and DQN, the same parameters (the online Q network) are used
for both action selection and evaluation, which can result in inflated Q-value estimates, especially for
actions consistently given higher values. Double DQN addresses the overestimation issue in traditional
Q-learning by introducing "target networks" Instead of using a single set of parameters for both action
selection and evaluation, Double DQN uses two separate networks: the online network for selecting
actions and the target network for evaluating Q-values. The target network, updated periodically
with the online network’s parameters, provides more stable and less optimistic estimates, reducing
overestimation during evaluation.</p>
        <p>During training, the parameters of the online network are updated by minimizing the loss function,
commonly using the Mean Squared Error (MSE) between the Q-value estimated by the online network
and the target value generated by the target network. This process begins by randomly sampling state
transitions from a replay memory bufer, which holds a collection of past actions and states. Unlike
traditional Q-learning that relies on matrix-based operations, these updates are processed by neural
networks guided by the Bellman equation.</p>
        <p>Figure 1 illustrates the Q-Network and its connection to the replay memory bufer, which stores
tuples representing individual experiences. Each tuple contains the current state, the chosen action, the
resulting new state after action execution, the observed reward, and an indication of episode termination.
These stored experiences are sampled to compute the Bellman equation term using the target network’s
inference. The online network is then optimized through gradient descent to approach this target
Bellman value. As shown in Figure 1, both the online and target networks share identical architectures.
Each network is composed of multilayer perceptrons (MLPs) that receive a flattened representation of
the state space (as detailed in Section 3.3) and output value estimates for each possible action.</p>
        <p>By decoupling the action selection and evaluation processes, Double DQN reduces the likelihood
of overestimating Q-values, leading to more stable and accurate learning. This approach has been
shown to improve the performance and training stability of deep reinforcement learning algorithms,
particularly in environments with large state spaces or complex dynamics.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. System model</title>
        <p>We operate in a mobile edge computing (MEC) system comprising multiple edge devices of number ,
each equipped with a deployed DNN model. All devices are connected to an edge server via wireless
communication, which also hosts a DNN model to handle any remaining processing tasks. The DNN
models are partitioned into  segments, indexed by  for each device , with each segment generating
a number of floating-point operations denoted as . The partitioning is achieved using cut layers,
which in our current work are selected as pooling layers due to their smaller output sizes, facilitating
more eficient wireless transmission. When a device  performs inference up to a selected cut layer
(), the computing delay is directly proportional to the number of layers processed and the device’s
hardware characteristics, such as CPU frequency  . This delay is determined by the following formula:
(1)
(2)
(3)
(4)
(5)
(6)</p>
        <p>Here,  2 represents the spectral noise, while () denotes the time-varying channel gain between
the device and the server, capturing the fluctuating network conditions over time. It is important to
note that the available bandwidth is equally divided among all devices, which is not the most eficient
allocation strategy, as the computing demands of individual devices usually vary. Nevertheless, we
adopt this approach since our current focus is on the reinforcement learning method rather than
optimizing resource allocation. We will also note that 0 is equal to the initial input size multiplied by
the compression rate (). And the transmission energy consumption formula is given by :
() =  ()</p>
        <p>We adopt a linear model to estimate the energy consumption during computation, as outlined in [19].
Given the energy eficiency coeficient  the energy consumption is defined as:</p>
        <p>For the communication delay, we use Shannon’s formula, which relates the transmission delay to the
output data size from the partial inference  , the transmission power of the device  , and the network
bandwidth  , resulting in the following formula:
() =
∑︀=− 01 ( − ()) ×  ×</p>
        <p>() = () × ( )3 ×  
 () =</p>
        <p>(1 +   () )</p>
        <p>2
− 1 (() − ) ×  × 
() = ∑︁</p>
        <p>=0
() =
{︃0, if  &lt; 0</p>
        <p>1, if  &gt;= 0
 denotes the device’s computing performance (number of cycles per flop), and  is defined as:
The remaining computation is carried out on the server side, where the CPU frequency is  , as
determined by the following formula:</p>
        <p>Given   arrived input data, and a corresponding selected cut layer (), the total delay is equal to:
− 1 ⎛  (() − )  +  ( − ())  +
() =  () ∑︁ ⎝   
=0 

 log(1 +   () ) ⎠
 2
⎞</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Problem formulation</title>
        <p>By presenting our system model, we arrive at the formulation of the optimization problem, which
involves:</p>
        <p>Minimize</p>
        <p>1 ∑︁ ∑︁ ()
 =0 =1</p>
        <p>This is achieved by determining the decision variables () and (), while accounting for the
stochastic variations in network conditions () and the data arrival rate  (). Consequently, solving
this problem is NP-hard, necessitating the use of a reinforcement learning approach that sequentially
optimizes the decision variables through a learned policy. However, prior to this, we need to model our
problem as a Markov Decision Process (MDP), which involves defining the state, action, and reward
structures.</p>
        <p>• State : We need to monitor the data arrival rate  (), the network conditions through (), and
the current workloads of both the device and server. These will be represented as the available
computing and transmission queues sizes Ψ () and Ψ () respectively, and also the computing
queue size for the server  (). The partial observations from each device will be consolidated
into a global observation. This aggregation will consequently define our state space as follows:
() = ( (), (),   (),   (),  ())
• Action : Each device must determine both the cut layer at which to perform data inference and
the compression rate to apply. Consequently, the action space will encompass the collective
actions of all devices, resulting in:
() = ((), ())</p>
        <p>() = ∑︁  ()
=1
(7)
(8)
(9)
with () = (1(), ...(), ... ()) and () = (1(), ..., (), ...,  ())
• Reward : Since our objective is to optimize the average service delay, we choose to maximize its
dual aspect, known as throughput. In this context, we reward the agent for each data completion
within the system, specifically the total number of completed tasks  () across all devices:</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Algorithm description</title>
        <p>In our work, we consider a single agent operating on the server side, which manages two Q networks
(the online and target networks) and maintains a memory bufer  for storing experiences. This
agent is responsible for observing the system state by gathering information from the devices and
the environment, formatting this data in accordance with Section 3.3, and subsequently determining
the actions for each device using an epsilon-greedy policy. This approach involves selecting random
actions with a probability of epsilon to encourage exploration, while also enabling exploitation by
utilizing the online Q network to estimate the Q values for each action combination, ultimately selecting
the action with the highest Q value [20]. Subsequently, the corresponding actions are dispatched to
each device, prompting them to execute these tasks. This includes compressing the input data at the
specified rate and carrying out DNN inference up to the designated cut layer. Afterward, the devices
send the intermediate inference results to the server, which completes the remaining computations and
returns the final results to the respective devices. Once this cycle is completed, each device receives a
reward based on the formulation outlined in Section 3.3, and the results are transmitted to the server
for aggregation. The entire environment then transitions to a new state ′. This experience is recorded
in the replay bufer for future use in updating the Q networks. The online Q network is updated once a
suficient number of experiences have been gathered, specifically a batch size of stored experiences.
This update utilizes the standard Bellman equation, which establishes the relationship between the
Q value of the current state  and that of the subsequent state ′. The key distinction in the DDQN
approach lies in applying this equation using the target network, leading to the following formulation:
online(, | ) ∼ (1 −  ) · online(, | ) +  · (, ) +  · ma′ x target(′, ′| )
︂[
︂]
(10)
Backpropagation is then performed to minimize the mean squared error between the calculated value
and the stored value. After a predetermined number of iterations, determined by a constant , the
parameters of the target network are updated by copying those of the online Q network. To facilitate
the agent’s transition from exploration to increased exploitation, the epsilon value will be gradually
decreased in a linear manner until it reaches a predetermined minimum. This iterative process will
persist for a defined number of training episodes. The algorithm is outlined in Algorithm 1.
Algorithm 1 Double DQN (RL module at the server side)</p>
        <p>Initialize Q-networks online and target for the server with random weights
Initialize replay memory 
Set target network update frequency 
Set discount factor 
Set exploration parameters,  ,  min,  decay
for each episode  do</p>
        <p>Observe the initial state from state tracking module
for each time step  in  do</p>
        <p>Select action using  -greedy policy based on online
Distribute the actions to the devices
Execute Inference Tasks
Receive reward and observe the next state ′
Store (, , , ′, done) in replay memory 
if length of  ≥ replay batch size then</p>
        <p>Sample a random batch from replay memory 
for each sample do</p>
        <p>Calculate target Q-values using target and Bellman equation</p>
        <p>Update online using backpropagation
end for
end if
end if
  ←
end for
end for</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Environment</title>
        <p>if  is a multiple of  then</p>
        <p>Update target weights with online weights
epsilon-decay( ,</p>
        <p>)
We developed a simulated environment consisting of a scenario with three edge devices, characterized
by the parameters detailed in Table 2. In this setup, the VGG16 classification model is deployed on
either the devices or the server. We designated three distinct cut layers—specifically, layers 3, 10, and
18—corresponding to distributed pooling layers, and simulated data collection by randomly sampling
from a subset of the ImageNet dataset [21], which consists of 500 images. For the implementation of the
classification model, we utilized TensorFlow [22], which also facilitated the collection of intermediate
output sizes and the number of floating-point operations (FLOPs) through the TensorFlow Profiler. The
agent was designed to comply with the Gymnasium interface [23], while the DDQN algorithm was
implemented using the OpenAI Stable-Baselines3 framework [24].</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Convergence</title>
        <p>Figure 2a represents the evolution of rewards over successive episodes. Despite some noise, the trend
shows a clear upward trajectory as the episodes progress, indicating that the reinforcement learning
agent is successfully improving its policy, resulting in consistently higher rewards. While Figure 2b
illustrates the evolution of the loss function, showing a decreasing trend. This indicates that the neural
networks are successfully training and adapting well to the accumulated experiences. Through these
experiments, we demonstrated the convergence and learning progress of the DDQN algorithm.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Comparison</title>
        <p>
          In this section, we aim to evaluate the performance of our reinforcement learning approach compared
to static configurations, as well as to benchmark it against existing state-of-the-art RL solutions.
We use several performance metrics, including the evolution of the episodic cumulative reward, the
latency—represented by the average service delivery delay derived from the throughput (i.e., the number
of completed inference tasks per episode)—and energy consumption resulting from computation and
transmission. The visible plots represent values averaged over 10 episodes. To achieve the first objective,
we compare our approach with two benchmarks: "edge," where the inference is performed entirely
on the devices, and "central," where the inference is fully ofloaded to the edge server. For the second
objective, we benchmark against the actor-critic approach (A2C), commonly used in state-of-the-art
works [
          <xref ref-type="bibr" rid="ref13 ref14 ref9">9, 14, 13</xref>
          ], as well as the widely adopted DQN [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ]. The results are presented in Figure 3.
(a) Episode rewards evolution
(b) Q Network training progress
        </p>
        <p>One notable observation is the marginal superiority of the central approach over the edge solution in
terms of system throughput. However, this advantage is contingent upon favorable network conditions,
where server performance in computing becomes the guiding factor. As anticipated, the edge solution
exhibits stable behavior across varying network conditions. Furthermore, the reinforcement learning
solutions (DDQN, DQN and A2C) demonstrate superior adaptability after suficient training episodes,
outperforming the former approaches (edge and central).</p>
        <p>For the comparison of reinforcement learning approaches, the results demonstrate that the policy-free
methods (DQN and DDQN) both outperform the A2C approach in terms of learning behavior, showing
greater stability and higher reward values. Additionally, these approaches achieve better performance
metrics, such as increased throughput, which leads to reduced latency. Moreover, the DDQN approach
exhibits more stability compared to DQN, as observed in the results. This highlights the superiority
of the DDQN algorithm over other deep reinforcement learning methods. This experiment serves as
empirical validation of the findings in [ 16] within the context of DNN inference ofloading, showing
that DDQN improves Q-value estimation stability by reducing overestimation bias.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we tackled the challenging problem of DNN inference ofloading using a reinforcement
learning approach, specifically leveraging the DDQN algorithm due to its robust capabilities. Our results
demonstrated its competitive performance against state-of-the-art solutions. Future research could
focus on refining the MDP design by incorporating additional state and action elements, enhancing
the system’s adaptability. Additionally, shaping the reward function to encompass a broader range of
objectives, such as more eficient resource management or explicit energy optimization, ofers promising
directions for further improving the system’s performance.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI and AI-assisted Technologies</title>
      <p>During the preparation of this work, the authors used OpenAI’s ChatGPT (GPT-4-turbo, February
2024 version) in order to: Grammar and spelling check, as well as language refinement. After using
this service, the authors reviewed and edited the content as needed and take full responsibility for the
publication’s content.
[16] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning, in:</p>
      <p>Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
[17] Q. Zhang, M. Lin, L. T. Yang, Z. Chen, S. U. Khan, P. Li, A double deep q-learning model for
energy-eficient edge scheduling, IEEE Transactions on Services Computing 12 (2019) 739–749.
doi:10.1109/TSC.2018.2867482.
[18] A. Iqbal, M.-L. Tham, Y. C. Chang, Double deep q-network-based energy-eficient resource
allocation in cloud radio access network, IEEE Access 9 (2021) 20440–20449. doi:10.1109/
ACCESS.2021.3054909.
[19] A. Beloglazov, R. Buyya, Optimal online deterministic algorithms and adaptive heuristics for
energy and performance eficient dynamic consolidation of virtual machines in cloud data centers,
Concurrency and Computation: Practice and Experience 24 (2012). URL: https://api.semanticscholar.
org/CorpusID:10061036.
[20] O. Berger-Tal, J. Nathan, E. Meron, D. Saltz, The exploration-exploitation dilemma: a
multidisciplinary framework, PloS one 9 (2014) e95693.
[21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of
computer vision 115 (2015) 211–252.
[22] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning
on heterogeneous systems, 2015. URL: https://www.tensorflow.org/, software available from
tensorflow.org.
[23] M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goulão, A. Kallinteris,
M. Krimmel, A. KG, et al., Gymnasium: A standard interface for reinforcement learning
environments, arXiv preprint arXiv:2407.17032 (2024).
[24] A. Rafin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, M. Chevalier-Boisvert, J. Kubricht,
A. Nichol, B. Hill, J. Pachocki, et al., Stable baselines3, https://github.com/DLR-RM/
stable-baselines3, 2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Nadella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Meduri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Meduri</surname>
          </string-name>
          ,
          <article-title>A systematic literature review of advancements, challenges and future directions of ai and ml in healthcare</article-title>
          ,
          <source>International Journal of Machine Learning for Sustainable Development</source>
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>115</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kondam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yella</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence and the future of autonomous systems</article-title>
          ,
          <source>Innovative Computer Sciences Journal</source>
          <volume>9</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Herath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <article-title>Adoption of artificial intelligence in smart cities: A comprehensive review</article-title>
          ,
          <source>International Journal of Information Management Data Insights</source>
          <volume>2</volume>
          (
          <year>2022</year>
          )
          <fpage>100076</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Baccour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mhaisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Abdellatif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Erbad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guizani</surname>
          </string-name>
          ,
          <article-title>Pervasive ai for iot applications: A survey on resource-eficient distributed artificial intelligence</article-title>
          ,
          <source>IEEE Communications Surveys &amp; Tutorials</source>
          <volume>24</volume>
          (
          <year>2022</year>
          )
          <fpage>2366</fpage>
          -
          <lpage>2418</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>A survey on internet of things: Architecture, enabling technologies, security and privacy, and applications</article-title>
          ,
          <source>IEEE internet of things journal 4</source>
          (
          <year>2017</year>
          )
          <fpage>1125</fpage>
          -
          <lpage>1142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Surianarayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Chelliah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Prakash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hewage</surname>
          </string-name>
          ,
          <article-title>A survey on optimization techniques for edge artificial intelligence (ai</article-title>
          ),
          <source>Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          )
          <fpage>1279</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.-Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <article-title>Distributed split computing system in cooperative internet of things (iot)</article-title>
          ,
          <source>IEEE Access</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsubara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Levorato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Restuccia</surname>
          </string-name>
          ,
          <article-title>Split computing and early exiting for deep learning applications: Survey and research challenges</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Wu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning based energyeficient collaborative inference for mobile edge computing</article-title>
          ,
          <source>IEEE Transactions on Communications</source>
          <volume>71</volume>
          (
          <year>2022</year>
          )
          <fpage>864</fpage>
          -
          <lpage>876</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Department</surname>
          </string-name>
          , Iot: Number of connected devices worldwide 2012-
          <fpage>2025</fpage>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          , W. Wu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Stochastic cumulative dnn inference with rl-aided adaptive iot device-edge collaboration</article-title>
          ,
          <source>IEEE Internet of Things Journal</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Baccour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Erbad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guizani</surname>
          </string-name>
          , Rl-distprivacy:
          <article-title>Privacy-aware distributed deep inference for low latency iot systems</article-title>
          ,
          <source>IEEE Transactions on Network Science and Engineering</source>
          <volume>9</volume>
          (
          <year>2022</year>
          )
          <fpage>2066</fpage>
          -
          <lpage>2083</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Accuracy-guaranteed collaborative dnn inference in industrial iot via deep reinforcement learning</article-title>
          ,
          <source>IEEE Transactions on Industrial Informatics</source>
          <volume>17</volume>
          (
          <year>2020</year>
          )
          <fpage>4988</fpage>
          -
          <lpage>4998</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Task ofloading decision-making algorithm for vehicular edge computing: A deep-reinforcement-learning-based approach</article-title>
          ,
          <source>Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          )
          <fpage>7595</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Rahmati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shah-Mansouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Movaghar</surname>
          </string-name>
          ,
          <article-title>Qoco: A qoe-oriented computation ofloading algorithm based on deep reinforcement learning for mobile edge computing</article-title>
          ,
          <source>arXiv preprint arXiv:2311.02525</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>