<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Reinforcement Learning for Longitudinal Control of CAVs in Mixed Trafic Environments with Unreliable Com munication</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shuncheng Cai</string-name>
          <email>cais@tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Melanie Bouroche</string-name>
          <email>melanie.bouroche@tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science and Statistics, Trinity College Dublin</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Longitudinal control is a crucial aspect of autonomous vehicle operation, which aims to ensure safe spacing between vehicles by controlling their speed. This task is particularly challenging due to the stringent timeless requirements. Connected and Autonomous Vehicles (CAVs) share information about their position and speed to improve control decisions, but inter-vehicle communication cannot be assumed to be reliable in realistic conditions. Multi-agent Deep Reinforcement Learning (MADRL) has been proposed to address this issue and ensure the trafic flow's safety and eficiency. However, existing MADRL approaches face significant challenges in maintaining system stability and adaptability especially under unreliable communication conditions in mixed trafic environments. This paper proposes a longitudinal control strategy for CAVs in mixed trafic environments based on MADRL, aimed at stabilizing platoon control under unreliable communication conditions. Specifically, vehicle-to-vehicle (V2V) communication with varying levels of packet loss is incorporated into the DRL training environment to replicate communication scenarios that may be encountered in the real world. A distributionbased mapping mechanism is designed to smooth the action selection space of the agents. The proposed algorithm is compared with baseline models through simulation experiments, and its control performance and adaptability are further validated under diferent packet loss levels. Connected and autonomous vehicle (CAV), Longitudinal control, Unreliable communication, Multi-Agent deep ATT'24: Workshop Agents in Trafic and Transportation, October 19, Santiago de Compostela, Spain 0000-0003-3341-9599 (S. Cai); 0000-0002-5039-0815 (M. Bouroche) Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, considerable exploration has been undertaken regarding longitudinal control strategies
for Connected and Autonomous Vehicles (CAVs), primarily focusing on Model Predictive Control
(MPC)-based strategies and Deep Learning (DL) approaches, particularly Deep Reinforcement Learning
(DRL). Both methods have unique advantages under varying communication reliability conditions, and
they also encounter specific challenges.</p>
      <p>
        MPC-based strategies excel in optimizing multiple objectives within a flexible framework by predicting
future vehicle states and dynamically adjusting driving behavior [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. As demonstrated by Liu
et al., MPC-based strategies efectively automate path planning in structured driving environments,
highlighting the precision of these methods in real-world applications [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, MPC typically
necessitates the optimization problem to be well-defined and solvable within a short time frame, which
can impose significant computational demands depending on the complexity of the formulation, making
real-time implementation challenging [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Additionally, MPC’s performance relies heavily on the
system model’s accuracy, and any discrepancies can significantly degrade control efectiveness. Finally,
computational delays can adversely afect control performance in fast-changing dynamic environments,
potentially leading to instability [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>
        DRL-based strategies have emerged as a robust alternative, particularly advantageous in environments
characterized by stochastic factors and partial observability. These strategies shift the computational
burden to the ofline phase, allowing for rapid online implementation as turnkey solutions[
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7, 8, 9, 10</xref>
        ].
https://www.scss.tcd.ie/Melanie.Bouroche/ (M. Bouroche)
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>
        The flexibility of DRL enables it to adapt to dynamic environments without the need for deterministic
system dynamics, thus enhancing the controller’s responsiveness to real-time changes. Kiran et al.
survey various implementations of DRL in autonomous driving, confirming its efectiveness in learning
complex policies for dynamic and unpredictable environments[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. He et al. demonstrate a novel
application of DRL for managing multiple vehicle subsystems, showcasing the method’s capacity to
adapt seamlessly to real-time network conditions without requiring deterministic inputs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>However, most existing works focus on controlling and studying CAV platoons and do not address
mixed environments. Additionally, in mixed trafic environments, the mechanisms designed to smooth
the action selection space of agents are frequently inadequate, leading to abrupt or suboptimal
control actions. Finally, the communication between CAVs is typically assumed to be perfect, which is
unrealistic.</p>
      <p>
        In mixed trafic, longitudinal control presents unique challenges regarding the interaction between
CAVs and diferent types of vehicles, such as human-driven vehicles (HDVs). Two major challenges
are the heterogeneity of vehicle platoons and the random and uncertain behavior of HDVs [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ].
An adaptive leader-following control method for heterogeneous platoons, proposed by Li and Sun,
performs well under ideal conditions but lacks responsiveness and adaptability in complex, dynamic
environments [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. A robust adaptive cruise control method for heterogeneous platoons with uncertain
dynamics, developed by Wang and Zhang, can result in abrupt and discontinuous control actions when
handling sudden events and dynamic changes [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Although these studies have made progress in
addressing HDV uncertainties and mixed trafic heterogeneity, they still fall short in mitigating abrupt or
suboptimal control actions in decision-making. This inadequacy can cause disruptions to surrounding
HDVs, increasing the risk of trafic accidents.
      </p>
      <p>
        Packet loss poses a major obstacle in realistic V2V communication simulations. It arises when
transmitted packets fail to reach their intended destination due to various factors, including signal
degradation, network congestion, or errors in data transmission. Packet loss in a CAV environment can
result in outdated or incorrect information regarding vehicle states, which may have an impact on the
overall trafic flow and safety [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ]. While DRL has demonstrated significant potential in optimizing
multi-agent systems, it has scarcely been applied to packet loss scenarios. Shi et al. investigated
unreliable communication by incorporating Signal-to-Interference-plus-Noise Ratio(SINR) into agents.
However, this study did not specifically address packet loss scenarios, focusing instead on signal quality
only [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>Our goal is to create an integrated and robust DRL distributed control technique that focuses on
stabilizing longitudinal control in mixed trafic scenarios under unreliable communication conditions.
To achieve this purpose, our DRL framework introduces two novel features:
• Optimizing the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm to enhance the
coordination and eficiency of CAVs in mixed environments and under imperfect
communication conditions. Specifically, we designed a reward function by weighted aggregation of CAV
information and factors afecting CAV driving eficiency and safety, thereby balancing the
impact of individual vehicle states in mixed environments. This method not only optimizes the
decision-making process of individual vehicles but also significantly enhances the robustness of
the overall control strategy, particularly in addressing packet loss.
• Developing a distribution-based mapping mechanism to smooth the action selection space of
the agents. This mechanism mitigates abrupt or suboptimal control actions by ensuring more
continuous and adaptive decision-making processes. Integrating this mechanism aims to
improve the coordination and responsiveness of the CAVs under dynamic trafic conditions and
communication disruptions.</p>
      <p>The remainder of the paper is structured as follows. A review of the state of the art based on
longitudinal control modules is presented in Section 2. The CAV longitudinal control framework and
the proposed MADRL algorithm are described in Section 3. The numerical experiments validating the
proposed CAV longitudinal control strategy are presented in Section 4. Finally, Section 5 gives the
conclusion of our work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the Art</title>
      <p>
        Progress in reinforcement learning has resulted in the creation of Proximal Policy Optimisation (PPO),
which has made substantial enhancements over previous methods by employing a policy gradient
approach to handle continuous domains better [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. While PPO has shown promise in improving the
performance of individual vehicles, its eficacy diminishes in situations involving multiple agents or
in mixed trafic environments that consist of both CAVs and HDVs [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. One of the challenges is the
tendency of PPO to converge to suboptimal policies in the presence of multiple interacting agents,
leading to non-stationarity [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Expanding upon the capabilities of PPO, the introduction of MAPPO aimed to tackle the intricacies of
environments that involve multiple interacting agents by incorporating an understanding of agent
interdependencies. This understanding is crucial for efective platoon control in CAV systems [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Despite
the progress made, MAPPO and other RL strategies still face challenges in dealing with communication
inconsistencies and non-stationarities commonly encountered in real-world deployments.Zhang et al.
investigated the use of MAPPO for coordinated longitudinal control of CAVs in platooning scenarios,
demonstrating its efectiveness in maintaining stable inter-vehicle distances and improving trafic flow
eficiency [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Another study by Chen et al. applied MAPPO to enhance the decision-making process
of CAVs during highway merging, showing that the approach can significantly reduce merging times
and improve overall trafic safety [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. These studies highlight the advancements in applying MAPPO
to longitudinal control in CAVs, contributing to improved trafic management and safety in autonomous
driving scenarios. However, further research is necessary to enhance the robustness and adaptability of
these strategies in diverse and dynamic real-world environments.
      </p>
      <p>
        The research conducted by Shi et al. introduces a methodology that incorporates realistic data and
signal-interference-plus noise ratio (SINR) into a DRL framework under mixed trafic environments[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
However, further investigation into the efects of packet loss and the use of MADRL techniques would
be beneficial. These additions would enhance adaptability and robustness in decentralized environments
where reliable communication cannot be assumed.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        This paper presents a control framework that incorporates a CAV control strategy using MAPPO
and a distribution-based mapping action selection mechanism to enhance CAV driving behavior and
mitigate trafic oscillations in real-world scenarios. Specifically, the advantages of the
distributionbased mapping action selection mechanism in continuous control tasks have been substantiated in
recent RL literature[
        <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
        ]. This framework is designed to adapt to the ever-changing conditions of
the communication environment, allowing CAVs to efectively maintain stable and eficient driving
behaviors even in the face of communication uncertainties.
      </p>
      <p>The section begins with an overview of the environment settings, before detailing the CAV
longitudinal control framework and the mapping-based action selection mechanism.</p>
      <sec id="sec-3-1">
        <title>3.1. Environment Model</title>
        <p>The simulation is centered around a car-following scenario on an infinitely long straight highway
without lateral movement. The vehicle platoon consists of CAVs controlled by MAPPO agents and
HDVs that follow the Krauss model, which is a widely used microscopic trafic model that simulates the
longitudinal driving behavior of human drivers, including acceleration, deceleration, and maintaining
safe distances. The model is based on the following assumptions:
• Every CAV can transmit the latest state information, including location and speed, to other CAVs
in the platoon through V2V communication. However, due to packet loss, this information may
not always be successfully received.
• The communication time is considered to have a negligible impact on the simulation process in
our study.</p>
        <p>To illustrate how the proposed control framework handles communication interruptions, consider
that each CAV in the simulation maintains a state vector representing the latest valid data received from
upstream vehicles. For instance, if the state vector for vehicle  is represented as v = [ ,1 ,  ,2 , … ,  , ],
where each  , is the state information from vehicle  −  , the vector updates based on the last successfully
received data before a packet loss occurs.</p>
        <p>In this scenario, the control strategy adjusts the vehicle’s actions based on the most recent and
reliable data. For example, suppose vehicle  experiences packet loss, and the last known states before
the loss were  = 30
m/s and  = 100</p>
        <p>m. The MAPPO algorithm will use these last valid
values to calculate the next best action until updated, reliable data is available. This approach simulates
real-world V2V dynamics, ensuring control decisions are based on accurate information, enhancing</p>
        <sec id="sec-3-1-1">
          <title>CAV platoon safety and eficiency.</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Longitudinal Control Scheme</title>
        <p>Based on the environment setting, this part describes the control scheme of the proposed strategy,
focusing on the regulation of CAVs’ longitudinal control using the MAPPO-based approach. We
begin with outlining the design of the Deep DRL framework, which includes the definitions of the
four fundamental elements of DRL: state, action, policy, and reward. Following this, we detail the
implementation of the MAPPO algorithm, focusing on centralized training and decentralized execution.
Finally, the training process for the MAPPO model is described. The related notations are defined in</p>
        <p>,  
 (  ,    2)
_ 
()
 
 ̂ 

 ()

 1,  2</p>
        <p>desired, ,  safe,</p>
        <p>Definition</p>
        <p>Total number of CAVs and therefore of agents.</p>
        <p>Acceleration or deceleration action of CAV  at time  .</p>
        <p>Velocity of CAV  at  .</p>
        <p>Position of CAV  at  .</p>
        <p>Distance between CAV  and the preceding vehicle at time  .</p>
        <p>NN-predicted mean and standard deviation for CAV  at time  .</p>
        <p>Normal distribution for action   .</p>
        <p>Scaling factor for action values.</p>
        <p>PPO’s clipped objective function.</p>
        <p>Expectation over time.</p>
        <p>Probability ratio under policies for CAV  at time  .</p>
        <p>Clipping range hyperparameter.</p>
        <p>Weights for distance metrics.</p>
        <p>Desired and safe distances for CAV  at time  .
3.2.1. DRL Design
and reward.
definition is as follows:
DRL can be recognized as a Markov decision process, consisting of four elements: state, action, policy,
The state representation s in the DRL framework captures each CAV’s velocity and position. The
s = [ 1,</p>
        <p>1,  2,  2, … ,   ,   ]


(1)
where, the velocity and position of the  -th CAV at time  are represented by    and   respectively.
The action   represents the acceleration or deceleration for CAV  at each timestep. It is defined as:

 ∈ ℝ</p>
        <p>
          Specifically, we suggest implementing a method for mapping actions. In contrast to previous writers,
such as Shi et. al,[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], we put forward a strategy for action selection that is based on distribution. This
is because, in practical scenarios, it is challenging to alter the acceleration within a brief time frame
significantly. In contrast to the value-based selection mechanism commonly proposed in research
publications, the distribution-based mapping approach ofers a more seamless acceleration and is more
adept at managing uncertainties in real-world scenarios. This results in more resilient decision-making
in the face of uncertainty. Specifically, the actions are sampled from a normal distribution, which is
parameterized by neural network outputs that predicts the mean and standard deviation based on the
current state[
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ]:



, 
  = NN(  ),

 ∼  (  , 
  2)
Here,   and    represent the mean and standard deviation of the action distribution for CAV  at time  ,
respectively. To ensure that the action values are within a feasible range, a hyperbolic tangent function
(ℎ
) is applied, followed by scaling with a predefined mapping factor:


= 
_  ⋅
tanh( (  , 
  2))
(2)
(3)
(4)
(5)
(6)
 (  ) = ∑ ( 1 ⋅ |  −  desired, | −  2 ⋅ max(0,  


safe, − 
 ))
This feature integrates multiple elements to guarantee optimal vehicle spacing. Simultaneously, it will
impose penalties for the upkeep of hazardous gaps. By employing the control program, CAVs are able
to prioritize safety while also maintaining optimal eficiency.
3.2.2. MAPPO Algorithm
MAPPO ofers a decentralized perspective for organized oversight and efective and secure operation
of CAV platoon management, particularly in situations where CAVs possess limited local observation
capabilities. The MAPPO architecture utilizes the actor and critic components of the PPO algorithm to
implement a centralized training and decentralized execution structure [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Therefore, every intelligent
entity undergoes two distinct stages: centralized training and decentralized execution. It is important to
understand that the training process occurs without an active connection, and exploration is employed
to discover the most efective policy. During the execution phase, it is necessary to propagate forward
without introducing any randomization into the exploration process. In the following sections, we will
demonstrate the process of training the MAPPO model in a centralized manner and then implementing
the trained model in a decentralized fashion. To show this, we will use one of  agents as an example.
        </p>
        <p>During the centralized training process, the critic assumes the role of the central coordinator and
calculates the centralized action-value function (  ,   |) using the global state information, which
ensure training stability.
trafic flow eficiency:


=0</p>
        <p>This scaling transformation allows the action values to be specifically tailored to the dynamic range
required for optimal vehicle control, enhancing the flexibility and precision of the system.</p>
        <p>The policy  (  )is updated within the PPO framework using the clipped surrogate objective:
  
() =  ̂  [min(
 () ̂ , clip(


  (), 1 − , 1 + )  ̂ )]
where   () is the ratio of the probabilities under the new and old policies for the taken action, and  is a
hyperparameter defining the clipping range, which helps control the magnitude of policy updates and

The reward function  ( 
, 
 ) is designed to incentivize driving behaviors that improve safety and

includes the actions and observations of all agents. The centralized Q function assesses the actor’s
activities from a comprehensive viewpoint and directs it toward selecting better actions. The critic
network then updates the parameters by minimizing the loss:
(7)
(8)
(9)
where,
loss(  ) =  [(  ,   |) −  
2
]
 
=    +  (
 ′,  ′| ′)
Here  = ( 1,  2, … ,  

 ),  = ( 1,  2, … ,   )and  are the parameters of the critic network.   ′ denotes</p>
        <p>the updated state of the target network and  ′ is the updated parameter of the evaluation network.
On the other hand, the actor-network updates the network parameters  and outputs actions based on
the centralized  function computed by the critic network and its own observations. Specifically, the
actor-network adjusts the network parameters  directly in the direction of ∇  () . The global state
makes value learning faster and easier by assuming a centralized value function that changes a partially
observable MDP into a fully observable MDP:
∇  () ≈ [∇  (, |)∇
   ()].</p>
        <p>During the decentralized execution process, the critic network is omitted, and only the network of
trained actors operates online. Decentralized executors rely solely on local observations from CAVs to
make choices. During the whole execution process, there is a single forward propagation procedure,
resulting in significant reductions in time and computational resource consumption compared to the
training phase. Within a group of actors educated using parametric methods, each actor has the ability
to achieve an action that is very close to the best possible outcome, even without having knowledge of
the actions taken by other actors.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Distributed Action Mapping Mechanism</title>
        <p>. The action   is sampled from this distribution:
Our implementation enhances the robustness and adaptability of the action selection process by utilizing
a distribution-based approach rather than directly obtaining single action values. The policy network,
parameterized by  , outputs the parameters of a Gaussian distribution: the mean  and standard deviation
where   represents the state input of CAV  at time  . This approach allows the agent to explore a range
of possible actions instead of being restricted to deterministic policy outputs. The sampled action is
then transformed using a hyperbolic tangent function to map values to the desired range:
 ∼  ((


|),  (</p>
        <p>|))

 ′ = tanh( )</p>
        <p />
        <p>′ = tanh((

 |) +  ( 

 |) ⋅ )</p>
        <p>
          This bounds the action values within [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ], ensuring feasible control task actions. Mathematically,
this can be represented as:
where  ∼  (0, 1)
        </p>
        <p>is a noise term to facilitate exploration. This distribution-based approach
provides several advantages: balanced exploration and exploitation, robustness against uncertainties and
environmental variations, and efective end-to-end training via the reparameterization trick.</p>
        <p>The stochastic policy  ( 
|
 , ) is expressed as:

 ( 

|
 , ) =</p>
        <p>1
 ( 
 |) √2
exp (−
(
 − (</p>
        <p>|))2
2 ( 
 |) 2
)
in real-world scenarios.</p>
        <p>The transformed action   ′ ensures a smooth, bounded action space, enhancing stability and control</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Evaluation</title>
      <p>In this section, we conduct a series of numerical experiments to evaluate the efectiveness of the
proposed distribution-based MAPPO control strategy for CAV platoon management. The experiments
are designed to compare the performance of the MAPPO strategy with distribution-based and
valuebased PPO strategies under varying conditions of communication reliability. We utilize SUMO as the
trafic simulation platform to create controlled scenarios and analyze key metrics such as velocity,
acceleration, vehicle spacing, and reward values. Additionally, we investigate the impact of diferent
packet loss rates on the relative congestion index (RCI) to assess the robustness and generalizability of
the control algorithms.A significant number of studies have examined mixed trafic eficiency utilising
travel rate and congestion index as primary performance metrics. The trip rate indicates the duration
a vehicle need to traverse a certain segment of the road network, whereas the RCI provides a more
precise assessment of trafic flow conditions. An RCI score exceeding 2 signifies extremely heavy trafic
congestion.</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluation Scenario</title>
        <p>At the onset of the experiment, we establish a less complex scenario as a preliminary stage. The
experiment utilizes SUMO (Simulation of Urban MObility) as the trafic simulation platform to create
a scenario consisting of a single-lane, intersection-free, infinitely long straight road. This scenario
is specifically created to accurately replicate the behavior of vehicles traveling in formation. The
vehicle formation consists of six vehicles, categorized into two types: HDVs and CAVs, as shown in
Figure 1. The HDV serves as the lead vehicle and is formally represented by SUMO’s Krauss model. It
possesses a driving behavior index (sigma) of 0.7, indicating a high level of randomness in its behavior.
To diferentiate it from the other vehicles, the HDV is highlighted in red. All five following vehicles
are CAVs and are visibly designated with the color yellow. The CAVs, listed in order from left to
right, are  ℎ0,  ℎ1,  ℎ2,  ℎ3,  ℎ4 . The lead vehicle is an HDV with the number  ℎ . All vehicles
follow an identical path and are arranged in a straight line along the road. All vehicles initially have
a velocity of 10 m/s and are positioned sequentially with 20-meter gaps between them. The allowed
range of velocities is between -3 m/s and +3 m/s, and the maximum velocity is capped at 30 m/s. The
simulation begins at time 0 seconds and continues until 60 seconds. Each simulation step is set to
0.1 seconds to ensure accurate data collection and real-time performance, allowing the algorithm to
respond efectively to dynamic changes in the simulation environment. Fluctuating alterations occur as
the scenario progresses.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Performance Evaluation</title>
        <p>During the creation of the experimental situations, we systematically compare the suggested
distributionbased MAPPO control strategy, the distribution-based PPO control strategy, and the value-based PPO
control strategy. Here are two PPOs, one implementing the distribution-based mapping mechanism
presented in this paper and the other directly calculating the values. Fig. 3-6 display the simulated
outcomes with a 10% packet loss rate obtained from applying three distinct strategy to CAVs platoon.</p>
        <p>Our primary concern is the velocity variation of the platoon throughout the entire procedure. Figure 2
demonstrates that the distribution-based MAPPO method achieves smoother and more stable speed
transitions, aligning well with the expected performance of a distribution-based strategy. In contrast,
while the distribution-based PPO employs a similar strategy, its overall speed fluctuations are more
pronounced compared to MAPPO, indicating greater dificulty in achieving stability under the same
conditions. Finally, the value-based PPO method, although it eventually reached a consistent speed,
exhibited notable disparities in speed and intermittent fluctuations in speed as compared to the leading
HDV  ℎ . This indicates a possible delay in response or adjustment.</p>
        <p>Subsequently, we examine variations in the acceleration of the three control strategies. As shown
in Figure 3, The acceleration of the distribution-based MAPPO strategy indicates that the vehicle
experiences rapid acceleration and soon stabilizes within the first 100 time steps. This demonstrates
a high level of smoothness and stability, eficiently managing the instability caused by unreliable
communication and HDV. Furthermore, despite the distribution-based PPO strategy exhibiting faster
acceleration during the initial stages, it is evident that it cannot still quickly adjust to the instability of
HDV and is marginally less adaptable. Overall, the platoon managed by the distribution-based PPO
method and the MAPPO strategy demonstrates a superior capacity to adjust to the unpredictability
and volatility of the HDV. This is evident from the fact that the CAVs align more closely with the
frequency of acceleration changes in the HDV. The value-based PPO strategy exhibits the greatest
variation in acceleration during the entire experiment, particularly in the initial phase. The estimation
of acceleration by CAVs is consistently lower than the actual value of HDV, indicating a clear deficiency
in dynamic adaptability.</p>
        <p>Figure 4 demonstrates the strong consistency and stability of the distribution-based MAPPO and
distribution-based PPO techniques in regulating the distance between vehicles. These two tactics
eficiently achieve a rapid stabilization of vehicle spacing, minimizing fluctuations between vehicles
and ensuring the overall formation and safe distance of the platoon. Nevertheless, the value-based
PPO strategy demonstrated a substantial disparity from the leading vehicle, particularly throughout
the middle and late phases of the experiment. The reason for this could be that value-based PPO may
lack the flexibility of distribution-based methods when it comes to handling rapidly changing network
conditions. As a result, it becomes challenging to maintain constant distances between cars.</p>
        <p>Finally, we documented the reward curves as shown in Figure 5. From the reward curves, it is
evident that both the distribution-based MAPPO and PPO strategies exhibit a significant increase in
reward values during the initial phase of the experiment. However, after around 200-time steps, the
reward values reach a plateau, with a final value close to 20,000. The swift convergence seen indicates
that these two strategies, which rely on distribution-based methods, are highly efective in addressing
unpredictable and dynamic settings. Furthermore, they demonstrate a remarkable ability to learn and
optimize their performance. On the other hand, the PPO strategy is based on values, and while it
also shows an initial increase in rewards during the experiment, it ultimately reaches a stable reward
value of around 14,000, which is considerably lower than the other two strategies. The results indicate
that distribution-based methods may ofer better performance in a reinforcement learning system,
particularly in communication-limited contexts that need complexity and robustness.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Packet Loss Rate Efects</title>
        <p>
          In addition to performance comparisons and tests, we also conducted experiments on the impact of
packet loss rate on platoon control in mixed trafic scenarios. These experiments served to assess the
algorithms’ robustness and generalizability. We have selected the relative congestion index (RCI) as our
evaluation criterion since it precisely represents the current condition of trafic flow operation. RCI
values exceeding 2 indicate a significant level of trafic congestion. Analysis has demonstrated that trip
rates are not a reliable indicator of trafic flow conditions in crowded trafic situations[
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. In situations
of heavy trafic congestion, vehicles either move at slow rates with few or no changes in speed, or at
somewhat higher speeds with frequent speed changes.
        </p>
        <p>The RCI of the three algorithms was measured at packet loss rates of 10%, 30%, and 50%.</p>
        <p>Table 2 displays the RCI values of the three algorithms at various packet loss rates for comparison. The
data clearly demonstrates that the distribution-based MAPPO method consistently achieves outstanding
performance across all test conditions. The RCI value only exhibits a modest increase from 1.247 to 1.449,
indicating exceptional control even in the presence of a high packet loss rate of 50%. This demonstrates
the exceptional flexibility and resilience of the strategy in ever-changing and demanding conditions.</p>
        <p>The distribution-based PPO strategy has commendable performance, as seen by its RCI value
remaining within the lower range of up to 1.597. This suggests that it possesses efective trafic flow control
capabilities, however slightly less superior compared to MAPPO.</p>
        <p>At 50% packet loss, the RCI value of the value-based PPO method exhibits substantial oscillations,
with a peak value of 1.684, which is notably greater than the RCI values of the other two techniques.
This tendency indicates that the value-based method may have more dificulties in sustaining platoon
stability and smoothness in environments with substantial packet loss, particularly in situations that
need quick adaptation to changes in the mixed environment.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper introduces a MADRL strategy for CAVs in contexts with inconsistent communication. Our
methodology enhances classic DRL by incorporating uncertainty in the platoon’s HDVs and accounting
for the efects of varying packet loss rates on the control algorithm. The suggested technique aims to
alleviate the efects of unreliable communication links on control signals and improve the vehicle’s
reaction by using a novel distribution-based action mapping approach and a weighted aggregate reward
function. During our research, we perform simulations implementing a group of six vehicles consisting
of one HDV at the front and five CAVs following behind. The purpose is to assess the efectiveness of
our proposed algorithm under diferent levels of communication reliability. The experiments involve
doing performance analysis using various algorithms and evaluating the impact under varied packet
loss rates. The results demonstrate that the suggested distribution-based control method surpasses the
two analyzed DRL algorithms in terms of platoon control performance. The results of our study show
that using MADRL is both possible and beneficial for improving the control of CAVs in situations under
mixed trafic environments. This strategy significantly outperforms current approaches in dealing
with uncertainty in communication. To expand on this study, future research could investigate the
integration of lateral control into the CAV framework to more efectively handle the intricacies of
merge, diverge, or reroute operations. Furthermore, the development of more precise modeling and
prediction techniques for HDVs could lead to the implementation of more resilient and efective control
mechanisms.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the Science Foundation Ireland Centre for Research Training in Artificial
Intelligence (CRT AI) under Grant No. 18/CRT/6223, and the Science Foundation Ireland CONNECT
Research centre Phase 2, Grant 13//2077  2. For the purpose of Open Access, the author has
applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this
submission.</p>
      <sec id="sec-6-1">
        <title>The sources for the ceur-art style are available via</title>
        <p>• GitHub,
• Overleaf template.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Falcone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borrelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Asgari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tseng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hrovat</surname>
          </string-name>
          ,
          <article-title>Predictive active steering control for autonomous vehicle systems</article-title>
          ,
          <source>IEEE Transactions on Control Systems Technology</source>
          <volume>15</volume>
          (
          <year>2007</year>
          )
          <fpage>566</fpage>
          -
          <lpage>580</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Camacho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bordons</surname>
          </string-name>
          , Model Predictive Control, Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Varnhagen</surname>
          </string-name>
          , et al.,
          <article-title>Path planning for autonomous vehicles using model predictive control</article-title>
          ,
          <source>in: IEEE Intelligent Vehicles Symposium</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>797</fpage>
          -
          <lpage>802</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Rawlings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Q.</given-names>
            <surname>Mayne</surname>
          </string-name>
          ,
          <source>Model Predictive Control: Theory and Design</source>
          , Nob Hill Publishing,
          <year>2009</year>
          . URL: https://nobhillpublishing.com/MPC/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Badgwell</surname>
          </string-name>
          ,
          <article-title>A survey of industrial model predictive control technology</article-title>
          ,
          <source>Control Engineering Practice</source>
          <volume>11</volume>
          (
          <year>2003</year>
          )
          <fpage>733</fpage>
          -
          <lpage>764</lpage>
          . URL: https://doi.org/10.1016/S0967-
          <volume>0661</volume>
          (
          <issue>02</issue>
          )
          <fpage>00186</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. Q.</given-names>
            <surname>Mayne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Rawlings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Scokaert</surname>
          </string-name>
          ,
          <article-title>Constrained model predictive control: Stability and optimality</article-title>
          ,
          <source>Automatica</source>
          <volume>36</volume>
          (
          <year>2000</year>
          )
          <fpage>789</fpage>
          -
          <lpage>814</lpage>
          . URL: https://doi.org/10.1016/S0005-
          <volume>1098</volume>
          (
          <issue>99</issue>
          )
          <fpage>00214</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Campisi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Severino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Al-Rashid</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Pau,</surname>
          </string-name>
          <article-title>The development of the smart cities in the connected and autonomous vehicles (cavs) era: From mobility patterns to scaling in cities</article-title>
          ,
          <source>Infrastructures</source>
          <volume>6</volume>
          (
          <year>2021</year>
          )
          <fpage>100</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dormehl</surname>
          </string-name>
          , The Formula:
          <article-title>How Algorithms Solve All Our Problems</article-title>
          ... and
          <string-name>
            <given-names>Create</given-names>
            <surname>More</surname>
          </string-name>
          , Penguin,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Guizzo</surname>
          </string-name>
          ,
          <article-title>How google's self-driving car works</article-title>
          ,
          <source>IEEE Spectrum 18</source>
          (
          <year>2011</year>
          )
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ozbay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ban</surname>
          </string-name>
          ,
          <article-title>Developments in connected and automated vehicles</article-title>
          ,
          <source>Journal of Intelligent Transportation Systems</source>
          <volume>31</volume>
          (
          <year>2017</year>
          )
          <fpage>154</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kiran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sobh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Talpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mannion</surname>
          </string-name>
          , et al.,
          <article-title>Deep reinforcement learning for autonomous driving: A survey</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>22</volume>
          (
          <year>2021</year>
          )
          <fpage>712</fpage>
          -
          <lpage>733</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Integrated networking, caching, and computing for connected vehicles: A deep reinforcement learning approach</article-title>
          ,
          <source>IEEE Transactions on Vehicular Technology</source>
          <volume>66</volume>
          (
          <year>2017</year>
          )
          <fpage>10660</fpage>
          -
          <lpage>10675</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>J. M. Anderson</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kalra</surname>
            ,
            <given-names>K. D.</given-names>
          </string-name>
          <string-name>
            <surname>Stanley</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Sorensen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Samaras</surname>
            ,
            <given-names>O. A.</given-names>
          </string-name>
          <string-name>
            <surname>Oluwatola</surname>
          </string-name>
          ,
          <article-title>Autonomous vehicle technology: A guide for policymakers, Rand Corporation (</article-title>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tawadrous</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>A survey of deep learning techniques for autonomous driving</article-title>
          ,
          <source>Journal of Robotics and Automation</source>
          <year>2015</year>
          (
          <year>2015</year>
          )
          <fpage>111</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Cooperative control of heterogeneous connected vehicle platoons: An adaptive leader-following approach</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>20</volume>
          (
          <year>2019</year>
          )
          <fpage>761</fpage>
          -
          <lpage>772</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Wang</surname>
          </string-name>
          , H.,
          <string-name>
            <surname>W. Zhang,</surname>
          </string-name>
          <article-title>Robust cooperative adaptive cruise control of heterogeneous vehicle platoons with uncertain dynamics</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1177</fpage>
          -
          <lpage>1187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Willke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tientrakool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Maxemchuk</surname>
          </string-name>
          ,
          <article-title>A survey of inter-vehicle communication protocols and their applications</article-title>
          ,
          <source>IEEE Communications Surveys &amp; Tutorials</source>
          <volume>11</volume>
          (
          <year>2009</year>
          )
          <fpage>3</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hartenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Laberteaux</surname>
          </string-name>
          ,
          <article-title>A tutorial survey on vehicular ad hoc networks</article-title>
          ,
          <source>IEEE Communications Magazine</source>
          <volume>46</volume>
          (
          <year>2008</year>
          )
          <fpage>164</fpage>
          -
          <lpage>171</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ran</surname>
          </string-name>
          ,
          <article-title>A deep reinforcement learning-based distributed connected automated vehicle control under communication failure</article-title>
          ,
          <source>Computer-Aided Civil and Infrastructure Engineering</source>
          <volume>37</volume>
          (
          <year>2022</year>
          )
          <fpage>2033</fpage>
          -
          <lpage>2051</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <article-title>Connected automated vehicle cooperative control with a deep reinforcement learning approach in a mixed trafic environment</article-title>
          ,
          <source>Transportation Research Part C: Emerging Technologies</source>
          <volume>133</volume>
          (
          <year>2021</year>
          )
          <fpage>103421</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Deep multi-agent reinforcement learning for highway on-ramp merging in mixed trafic</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <fpage>113</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Cao, Data dissemination in vehicular ad hoc networks</article-title>
          ,
          <source>IEEE Signal Processing Magazine</source>
          <volume>28</volume>
          (
          <year>2011</year>
          )
          <fpage>84</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ju</surname>
          </string-name>
          , P. van
          <string-name>
            <surname>Vliet</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Arenz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
          </string-name>
          ,
          <article-title>Digital twin of a driver-in-the-loop race car simulation with contextual reinforcement learning</article-title>
          ,
          <source>IEEE Robotics and Automation Letters</source>
          (
          <year>2023</year>
          ). URL: https://www.ias.informatik.tu-darmstadt.de/uploads/Site/EditPublication/RAL_Siwei_Ju.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pinosky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Abraham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Broad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Argall</surname>
          </string-name>
          , T. Murphey,
          <article-title>Hybrid control for combining model-based and model-free reinforcement learning</article-title>
          ,
          <source>The International Journal of Robotics Research</source>
          <volume>42</volume>
          (
          <year>2023</year>
          )
          <fpage>337</fpage>
          -
          <lpage>355</lpage>
          . doi:
          <volume>10</volume>
          .1177/02783649221083331.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>e.</surname>
          </string-name>
          <article-title>a. Zhou, Rl-based car-following model for cavs</article-title>
          ,
          <source>Journal of Advanced Transportation</source>
          <volume>45</volume>
          (
          <year>2020</year>
          )
          <fpage>123</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>A physics-informed generative car-following model for connected autonomous vehicles</article-title>
          ,
          <source>Entropy</source>
          <volume>25</volume>
          (
          <year>2023</year>
          )
          <article-title>1050</article-title>
          . doi:
          <volume>10</volume>
          .3390/e25071050.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Savid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mahmoudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Maskeliunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Damaševičius</surname>
          </string-name>
          ,
          <article-title>Simulated autonomous driving using reinforcement learning: A comparative study on unity's ml-agents framework</article-title>
          ,
          <source>Information</source>
          <volume>14</volume>
          (
          <year>2023</year>
          )
          <article-title>290</article-title>
          . doi:
          <volume>10</volume>
          .3390/info14050290.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hamad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kikuchi</surname>
          </string-name>
          ,
          <article-title>Developing a measure of trafic congestion: Fuzzy inference approach</article-title>
          ,
          <source>Transportation Research Record</source>
          <year>1802</year>
          (
          <year>2002</year>
          )
          <fpage>77</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>