<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (V. Levytskyi);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Traffic optimization in a simple network using Deep Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Volodymyr Levytskyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksii Lopuha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavlo Kruk</string-name>
          <email>kruk97@ukr.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kyiv National University of Construction and Architecture</institution>
          ,
          <addr-line>31, Air Force Avenue, Kyiv, 03037</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>The optimization of traffic flow in urban environments remains a major challenge in contemporary research, despite the substantial body of scientific literature dedicated to this issue. While significant progress has been made, a universally effective solution applicable to real-world conditions has yet to be developed. One of the primary challenges lies in processing the vast amounts of incoming traffic data continuously collected from sensors distributed across urban road networks. Traditionally, due to the scale of this task, researchers have concentrated on designing systems with localized agents. These agents typically manage traffic at individual intersections, with coordination occurring within multi-agent systems. Recent advancements, however, address the complexity and volume of input data through the integration of deep learning techniques. In particular, the deep deterministic policy gradient (DDPG) algorithm has been proposed as a means to process extensive traffic data efficiently. An experimental study was conducted using a simplified intersection model to assess the effectiveness of this approach. The results indicate that the DDPG algorithm outperformed Q-learning in this scenario, achieving a reward in the range of 4 to 4.3 points, compared to Q-learning's reward range of 2 to 4 points. Performance evaluation, based on the average reward per episode, revealed that while both DDPG and Q-learning attained comparable reward levels, DDPG demonstrated more stable convergence (0.04-0.21 points), whereas Q-learning exhibited greater instability (0.04-0.43 points). Furthermore, an analysis of intra-episode performance indicated that DDPG achieved performance gains primarily toward the latter stages of each episode. Overall, the algorithm proved to be effective within this experimental framework, and the findings provide a foundation for further enhancements and applications in more complex traffic management scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;DDPG</kwd>
        <kwd>Aimsun</kwd>
        <kwd>Q-learning</kwd>
        <kwd>road traffic 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        durations for red, yellow, and green phases, regardless of real-time traffic conditions. In contrast,
actuated traffic lights dynamically adjust their signal phases based on data collected from local
traffic sensors installed near intersections. While this adaptive mechanism improves traffic flow
at individual intersections, it can hinder synchronization with adjacent intersections, reducing its
effectiveness in densely populated urban settings [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>Furthermore, existing systems do not fully integrate city-wide traffic flow data. Although
networked vehicle technologies have the potential to anticipate congestion, their practical
implementation is often restricted to routine interventions, such as manual traffic direction by
law enforcement. As a result, advanced strategies that could leverage comprehensive traffic data
to optimize traffic light performance remain largely underutilized.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <p>
        Machine learning presents a significant opportunity to improve traffic control systems by
optimizing signal timing based on real-time traffic conditions. Numerous studies have explored
the potential of machine learning in this domain. One notable example is a reinforcement
learning system developed using the Green Light District simulator, which serves as a
foundation for further research in this field [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Additional methodologies include fuzzy logic
and multi-agent systems, where agents managing individual intersections either exchange
information or respond to collective data from connected vehicles [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. However, these
approaches are often limited to isolated intersections or small networks, as they rely on partial
traffic data and fail to fully exploit the breadth of available information.
      </p>
      <p>
        A promising strategy to overcome these limitations is deep reinforcement learning (DRL), a
class of algorithms that leverage deep neural networks as function approximators for value
functions. The rise of DRL is largely attributed to the success of Deep Q-Networks (DQN), which
demonstrated exceptional performance in playing Atari games by processing raw pixel inputs
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Within the DQN framework, a neural network receives the state of the environment as input
and generates Q-values for all possible actions as output. This process is guided by a loss
function that determines the gradient direction. The initial success of reinforcement learning
using neural networks for function approximation can be traced back to TD-Gammon [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
However, despite early enthusiasm, this approach proved ineffective when applied to a wider
range of problems, ultimately leading to its decline [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The key challenges contributing to its
limitations include:



      </p>
      <p>
        The neural network was trained on values generated in real-time, resulting in sequential
data that exhibited strong correlations with previous values, thereby violating the
assumption of independent and identically distributed (i.i.d.) data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Policy fluctuations arose due to minor variations in Q-values, which in turn affected the
distribution of training data.</p>
      <p>Large optimization steps occurred when substantial rewards were received, leading to
instability in learning.</p>
      <p>To address these stability issues, the authors in [10] proposed several improvements,
including:


</p>
      <p>Experience replay: This technique involves storing past actions and rewards in memory,
allowing the neural network to be trained on randomly sampled experiences rather than
sequential real-time data. This mitigates the issue of temporal autocorrelation and takes
advantage of the off-policy nature of Q-learning.</p>
      <p>Reward scaling: By constraining reward values within the range of [−1, +1], this
approach prevents excessive weight magnification during error backpropagation.
Target network: Two DQNs are utilized, where one computes target values while the
other accumulates weight updates, which are periodically transferred to the first
network. This mechanism helps stabilize policy updates by reducing oscillations in
Qvalues.</p>
      <p>Despite these advancements, DQNs are primarily suited for discrete action spaces, making
them less effective for continuous action environments such as traffic light control, where
scalability remains a major challenge. To address this limitation, the Deep Deterministic Policy
Gradient (DDPG) algorithm has been introduced [11]. As its name suggests, DDPG integrates
the classical actor-critic framework of reinforcement learning with a deterministic policy
gradient [12].</p>
      <p>The foundational policy gradient algorithm was originally formulated in [10], where the
policy gradient theorem for stochastic policies was established. This theoretical groundwork
has since facilitated the development of more advanced methods for optimizing traffic control
systems in complex urban environments.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Purpose of the work</title>
      <p>This study aims to empirically validate the effectiveness of the DDPG algorithm in optimizing
traffic flow management in urban environments. The primary focus is on evaluating the
performance of the proposed DDPG approach relative to Q-learning using a simulated
intersection model. Furthermore, the study examines the progression of both the DDPG and
Qlearning algorithms within this simulation over time, providing a comparative analysis of their
evolution and effectiveness.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Traffic simulation concepts</title>
      <p>To assess the effectiveness of this approach, a traffic simulation tool is employed. Specifically,
Aimsun [13], a third-party microscopic traffic simulation software, has been selected to model
various traffic scenarios, including road networks, intersections, and traffic signals.</p>
      <p>A fundamental component of any traffic simulation is the network, which represents
roadways and intersections where vehicles navigate. Certain roads are linked to centroids that
function as sources and sinks for vehicles. The number of vehicles generated or absorbed by
these centroids is determined by an origin-destination demand matrix, in which each cell
represents the flow of vehicles between a specific origin and destination centroid. Different
demand matrices can be applied at various time intervals during the simulation to simulate
realworld traffic dynamics accurately.</p>
      <p>Road networks in the simulation often incorporate motion detectors that emulate induction
loops embedded in the roadway surface. These detectors collect essential traffic data, including
vehicle counts, average speeds, and occupancy rates.</p>
      <p>Traffic signals regulate vehicle movement at intersections, ensuring an orderly flow of traffic
and preventing congestion. Signals at a given intersection are coordinated so that one direction
receives a green light while the opposing direction sees a red light. This synchronization
facilitates efficient traffic management. The collective state of all signals at an intersection over
a defined period is referred to as a phase, which is determined by both the signal states and their
durations. The sequence of these phases forms a control plan, which is executed cyclically over
time. To further optimize traffic flow and minimize delays, control plans at adjacent
intersections are often synchronized.</p>
      <p>Once the simulation begins, centroids release vehicles according to the demand matrix,
directing them toward their respective destination centroids. To ensure realistic conditions, a
warm-up period is typically implemented, during which centroids generate vehicles to populate
the network before the official commencement of the simulation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. DDPG та Q-learning</title>
      <p>Most reinforcement learning (RL) algorithms are grounded in the Bellman equation:
V π ( s )= R ( s , π ( s ))+ γ ∑ P ( s'∨s , π ( s ))V π ( s' ) ,
s'
(1),
which defines the expected long-term reward — denoted as the value function V — for taking an
action determined by a given policy π , based on the immediate reward R(s, π(s)) received in
state s . The discount factor γ (where 0 ≤ γ &lt; 1) regulates the relative importance of immediate
versus future rewards [14]. The term ∑ P ( s'∨s , π ( s ))V π ( s' ) represents the weighted sum
s'
of the values of possible future states s', where each weight corresponds to the probability of
transitioning from state s to s' after executing action π ( s ). Here, V π ( s' ) reflects the expected
cumulative reward achievable from future state s' under policy π .</p>
      <p>Deep Deterministic Policy Gradient (DDPG) is a non-parametric (NP) algorithm that
integrates key elements of Q-learning — specifically, utility function approximation — with the
actor-critic architecture of reinforcement learning.</p>
      <p>Theorem 1 (Policy Gradient). For any Markov Decision Process (MDP), if the policy
parameters θ are updated in proportion to the gradient of the performance measure ρ, thenθ is
guaranteed to converge to a locally optimal policy to ρ . The policy gradient is computed as
follows:
∆ θ ≈ α ∂ ρ =α ∑ d π ( s )∑ ∂ π ( s , a) Qπ ( s , a ),
∂ θ s a ∂ θ</p>
      <p>∂ f ( s , a ; w )
∑ d π ( s )∑ π ( s , a)[Qπ ( s , a)−¿ f ( s , a ; w )]
s a ∂ w
=0 ¿,
where α is a positive step size, and
∂ ρ
∂ θ
the parameters θ . The term ∑ d π ( s )— denotes the weighted influence of states, where d π is
s
defined as the discounted visitation distribution over states encountered when starting from an
∞
initial state s0 and following policy π : d π ( s )=∑ γ t P ( st=s∨s0 , π ), whereγ is the discount
t=0
∂ π ( s , a)
factor that determines the present value of future rewards. The term ∑ refers to the
a ∂ θ
gradient of the policy π ( s , a), indicating how changes in the parameters θ influence the
probability of selecting action a in state s . Finally, Qπ ( s , a) is the action-value function, which
represents the expected cumulative reward when taking action a in state s and subsequently
following policyπ [10].</p>
      <p>This theorem was further extended in the same paper to accommodate cases where an
approximation function f is used in place of the policy π , as formalized in Theorem 2:</p>
      <p>Theorem 2 (Policy Gradient with Function Approximation). The policy gradient
theorem holds when the policy π is represented by an approximation function f ( s , a ; w )
provided that the updates to the weights w diminish as the functionf converges to the true
policy π . Specifically, the gradient of the expected return to the parameters w can be expressed
as:
— represents the gradient of the policy performance to
(2)
(3)
[Qπ ( s , a)−f ( s , a ; w )] represents the temporal difference error, which quantifies the
discrepancy between the true action-value function Qπ ( s , a) and its approximation f ( s , a ; w ).
This error guides the optimization process by indicating how far the current estimate f deviates
from the actual value under policy π .</p>
      <p>∂ f ( s , a ; w )
The term denotes the gradient of the approximating function to its parameters
∂ w
w. This gradient reflects how changes in the parameters affect the estimated value for a given
state-action pair and is used during the update step to minimize the approximation error.
Together, these elements are central to the learning process in actor-critic and
functionapproximation-based reinforcement learning algorithms, where the goal is to iteratively adjust
w to reduce the gap between the estimated and actual value functions.</p>
      <p>If the function f is compatible with the policy parameterization — meaning it satisfies the
∂ f ( s , a ; w ) ∂ π ( s , a) 1
condition: = , where serves as a scaling factor
∂ w ∂ θ π ( s , a)
accounting for the inverse of selecting in state
1
π ( s , a)
probability action a s ,
then the policy gradient can be simplified as: ∂ ρ =∑ d π ( s ) ∑ ∂ π ( s , a) f ( s , a ; w ) [10]. This
∂ θ s a ∂ θ
result is foundational in compatible function approximation, as it allows the use of function
approximators (e.g., neural networks) in policy gradient methods while preserving convergence
guarantees under certain conditions.</p>
      <p>To enhance the stability of the Deep Deterministic Policy Gradient (DDPG) algorithm, the
same stabilization techniques used in Deep Q-Networks (DQNs) can be employed — namely,
reward normalization, experience replay, and the incorporation of a separate target network.
For the latter, DDPG integrates two additional target networks — one for the actor and one for
the critic — which are utilized exclusively for computing the target Q-values. These target
networks remain distinct from the primary actor and critic networks. While the primary
networks are updated at every training step, their weights are used to softly update the
corresponding target networks over time.</p>
      <p>Although the theoretical foundations and implementation strategies for DDPG and
Qlearning are well-established in the literature, their application to real-world traffic optimization
scenarios remains underexplored. The novelty of the present study lies in the integration of
these algorithms with a high-fidelity microscopic simulation environment—specifically, the
Aimsun platform. Moreover, the study introduces new performance evaluation metrics, such as
the speed_score, which more accurately captures the influence of traffic control on overall traffic
flow dynamics.</p>
      <p>The research also highlights the advantages of DDPG over Q-learning in environments
characterized by continuous action spaces — an aspect that has not been thoroughly examined
within the domain of traffic signal control. As such, this work offers both theoretical
contributions regarding the stability and adaptability of reinforcement learning algorithms, and
practical insights for their deployment in real-world urban traffic networks.</p>
      <p>In conclusion, the findings of this study extend the applicability of reinforcement learning
techniques, offering a novel perspective on their integration into complex and dynamic traffic
management systems.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Input data</title>
      <p>Utilizing a simulation environment to assess a deep learning-based traffic management system
enables comprehensive monitoring of traffic conditions. However, for such a system to be
applicable in real-world scenarios, it is imperative that its input data originate from sources that
are commonly available within existing urban transportation infrastructures.</p>
      <p>Among the most accessible and widely deployed sources of traffic data are traffic detectors—
sensors strategically positioned throughout road networks to monitor vehicular activity. Of
these, inductive loop detectors embedded beneath the road surface are the most prevalent.
These detectors generate real-time data as vehicles traverse them, typically reporting the
following key metrics:


</p>
      <p>Vehicle count: The number of vehicles detected during a specified sampling interval.
Average speed: The mean velocity of vehicles over the sampling period.</p>
      <p>Occupancy: The proportion of time during which the detection area is occupied by a
vehicle, which serves as a valuable indicator of congestion levels.</p>
      <p>To ensure the feasibility of implementing the model in practical settings, its input parameters
are constrained to these standard outputs provided by traffic detectors: vehicle count, average
speed, and occupancy. Furthermore, the model incorporates a detailed representation of the
transport network, including road geometry, intersections, and their interconnections.</p>
      <p>This constraint guarantees the model's compatibility with real-world systems, thereby
facilitating a smooth transition from simulated environments to actual deployment in urban
traffic infrastructures. In accordance with this design principle—limiting data inputs to those
available under real-world conditions—a traffic data summary is constructed using vehicle
count, average speed, and occupancy metrics.</p>
      <p>Consequently, a performance metric referred to as the speed score (speed_score ) is
introduced. For a given detector i, it is calculated as follows:
speed scorei=min( avgspeedi ,1 .0),
max ⁡speedi
(4)</p>
      <p>
        Here, avgspeedi represents the average speed recorded by motion detector i, while max ⁡speedi
denotes the designated speed limit on the road segment where detector i is situated. Therefore,
the resulting speed estimate is bounded within the interval [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. This normalized metric is the
foundation for constructing both the environmental state representation and the reward
function within the reinforcement learning framework.
      </p>
      <p>The experimental setup employs a microscopic traffic simulator that advances the simulation
in discrete time steps. At each of these steps, the simulator updates the state variables—such as
position, velocity, and acceleration—for all vehicles, based on underlying system dynamics. By
default, each simulation step corresponds to a time increment of 0.75 seconds, a setting that
remains unchanged throughout this study.</p>
      <p>However, this interval is insufficient to capture meaningful fluctuations in traffic flow, as
detected by roadside sensors. Thus, a longer temporal window is introduced to aggregate traffic
data. This aggregation period is referred to as an episode step, or simply as a “step” when
unambiguous.</p>
      <p>The process unfolds as follows:
1. Traffic data is collected at each simulation step but is aggregated over each episode step.
2. The aggregated data is subsequently used as input for the DDPG algorithm.
3. To compute a representative score over time, the speed scores from individual detectors
are combined using a weighted average, where the weights are proportional to the
vehicle counts observed at each detector.
4. The traffic signal timings produced by the DDPG model are applied to the traffic
network during the next episode step.</p>
      <p>The optimal duration for an episode step was identified via grid search, with the
bestperforming interval determined to be 120 seconds.</p>
      <p>The speed-based metric described in Equation (4) is utilized to encode the environment's
state vector. This choice is justified not only by its ability to capture the degree of congestion
across the network but also by its incorporation of each road’s maximum allowable speed. As a
result, the environmental state is represented as a vector in which each component corresponds
to a specific detector and is computed as specified in Equation (5).</p>
      <p>statei=speed scorei,
(5)
where statei is the state vector and speed scorei — is the speed score.</p>
      <p>The rationale for employing a speed-based indicator lies in its capacity to reflect overall
traffic flow efficiency: a higher value of this indicator implies that vehicles are traveling at
speeds closer to the maximum allowable limit on the given road segment, thereby indicating
smoother traffic flow conditions.</p>
      <p>In real-world traffic systems, regulatory mechanisms such as traffic lights and temporary
signage are typically employed to manage vehicle movements. To maintain a focused and
controlled simulation environment, this study exclusively considers traffic light control, thereby
avoiding the complexities and potential instabilities associated with the direct manipulation of
individual signal states. Instead, signal coordination is achieved by alternating predefined
phases at intersections — for example, granting a green signal to one direction while the
perpendicular direction remains red.</p>
      <p>Rather than adjusting the total signal cycle time, the control strategy modifies only the
relative durations of individual phases, thereby preserving temporal synchronization between
adjacent intersections. This approach facilitates more coherent traffic patterns across the
network. Phase durations are normalized using the softmax function — also known as the
normalized exponential function — which ensures that the sum of all phase durations remains
constant. Moreover, a scaling factor is applied to constrain the adjustments to 80% of the total
cycle, preserving a minimum phase duration for operational stability.</p>
      <p>The reinforcement learning algorithm receives feedback via a reward signal, which
quantifies the effectiveness of previous actions. While traditional approaches frequently employ
travel-time-related metrics such as queue lengths or delays, the present method capitalizes on
data that is readily obtainable from real-world traffic detectors — namely, vehicle count and
average speed. The chosen reward metric is the speed estimate; however, to isolate the agent’s
contribution, a baseline is established using the speed estimate derived from a simulation
conducted in the absence of the reinforcement learning agent.</p>
      <p>The reward is then computed as the difference between the observed speed estimate and this
baseline, weighted by the number of vehicles at each detector to emphasize high-traffic zones.
To ensure numerical stability and tractability during training, the resulting value is further
scaled by a constant factor α, as defined in Equation (6).</p>
      <p>rewardi=α ⋅ counti⋅ ( spee d scorei−baselinei)=¿ α ⋅ counti⋅ [min( maavgxs⁡sppeeeeddii ,1 .0)−b(a6s)elinei]
,
where rewardi — reward, baselinei — baseline, counti — number of vehicles.</p>
      <p>
        In order to normalize the reward values within the range [
        <xref ref-type="bibr" rid="ref1">-1, 1</xref>
        ], one potential strategy
considered was dividing the reward by the total number of vehicles detected across all sensors.
However, this approach was ultimately dismissed, as it introduced inconsistency in reward
magnitudes across different simulation stages—specifically, lower vehicle counts at a given time
step would artificially inflate reward values, thereby distorting performance evaluations over
time.
      </p>
      <p>Instead, a fixed scaling factor α = 1/50 was empirically selected. This factor effectively
maintains the scaled reward values within a manageable range, typically close to 1.0, thereby
supporting stable gradient updates during the training process. Crucially, this method preserves
the relative magnitudes of the original rewards, avoiding the loss of informative variability that
would result from more aggressive normalization techniques.</p>
      <p>Each traffic detector computes its reward independently at every simulation step, without
aggregating values across detectors. These individual rewards are then directly input into the
DDPG optimization algorithm. Given the stochastic characteristics of the microscopic traffic
simulator, the simulation results are sensitive to the choice of random seed. To mitigate this
variability and ensure consistent baseline comparisons, simulations used for benchmarking are
conducted with fixed, predefined seed values.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Experiment setup</title>
      <p>To assess the performance of the deep neural policy (NP) algorithm, a series of simulations were
conducted using traffic networks of progressively increasing complexity. The Deep
Deterministic Policy Gradient (DDPG) algorithm governs all traffic light phases across the
network by leveraging input data from all detectors. For comparative purposes, two alternative
agents were employed: a Q-learning agent and a baseline random agent. The Q-learning agent
independently controls each traffic light phase using input from local detectors and operates in
discrete state and action spaces. In contrast, the random agent assigns arbitrary phase durations
drawn from a uniform distribution.</p>
      <p>The Q-learning implementation utilizes mosaic coding to discretize continuous state
variables into four distinct intervals. It adjusts phase durations based on a fixed set of predefined
ratios (e.g., 0.2, 0.5, 1.0), ensuring that the total cycle duration remains constant through
proportional phase adjustments. This architecture adopts a multi-agent configuration, in which
agents share some common input data but operate independently when selecting actions.</p>
      <p>
        The random agent assigns phase durations randomly within the interval [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], subsequently
scaling the values to maintain consistent cycle lengths. Given the inherent stochasticity of the
microscopic traffic simulator, all experiments were conducted using randomized seed values to
prevent overfitting and to ensure the robustness of the results. The outcomes were averaged
over multiple simulation runs and reported using statistical aggregates (e.g., mean, maximum,
minimum) to enable consistent comparative analysis.
      </p>
      <p>The baseline scenario, depicted in Figure 1, features a simplified traffic network consisting of
a single intersection formed by two bidirectional two-lane roads. Vehicles may travel straight or
turn right at the intersection, while left turns are prohibited, thereby reducing the complexity of
both traffic dynamics and signal control logic. The intersection is equipped with eight detectors
—one positioned before and one after the intersection on each road.</p>
      <p>The traffic signal at the intersection operates on a two-phase cycle. Phase 1 enables
movement along the horizontal road, while Phase 2 permits vertical traffic flow. Phase 1 has a
duration of 15 seconds, and Phase 2 lasts for 70 seconds, with a 5-second interphase period
between them. These deliberately unbalanced phase durations are designed to induce vehicle
accumulation along the horizontal road, thereby creating an opportunity for the learning
algorithm to demonstrate improvement by adjusting phase lengths. Each simulation run spans a
duration of one hour, with a constant vehicle demand of 150 vehicles per centroid pair.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Results</title>
      <p>To evaluate the performance of the DDPG algorithm in comparison with conventional
Qlearning and a random timing baseline on a specified test network, the primary evaluation
metric is the average reward per episode. As defined in Equation (5), the reward is represented
as a vector, with each element corresponding to a distinct traffic detector within the network.
Consequently, the average reward is computed as the mean value across all detector-specific
rewards within an episode. For consistency and comparability, the evaluation is based on the
best-performing experimental trial, which is defined as the trial that achieves the highest
average reward per episode throughout the simulation.</p>
      <p>Figure 2 presents a comparative analysis of performance across different algorithms. Both
the DDPG method and classical Q-learning attain comparable levels of average reward.
However, notable differences emerge in their convergence behaviors. While Q-learning exhibits
considerable instability and fluctuation in reward values over time, the DDPG approach
demonstrates significantly greater stability, maintaining consistent performance once peak
reward levels are achieved.</p>
      <p>Further insights into the algorithm's behavior can be gained by examining the performance
within individual episodes. Figure 3 illustrates the results of the first and the most successful
episodes of the DDPG algorithm. Performance is represented as the average reward per step,
accompanied by the corresponding minimum and maximum ranges. These statistics (mean,
minimum, and maximum) are computed across all trials conducted within a single experimental
run. The results indicate that, while improvements over the baseline are not uniformly
maintained throughout the episode, they remain evident up to its conclusion.</p>
      <p>A similar behavioral pattern is observed in the performance of the Q-learning algorithm, as
depicted in Figure 4.</p>
      <p>It is also essential to acknowledge several limitations of the present study:
</p>
      <p>Model Simplicity: The simulation is based on a simplified traffic network consisting of
a single intersection, which does not adequately capture the complexity of real-world
urban transportation systems. In practice, urban environments consist of numerous
interconnected intersections, multidirectional traffic flows, and more dynamic traffic
behavior.</p>
      <p>Parameter Sensitivity: The effectiveness of the algorithm is highly dependent on the
choice of hyperparameters, such as the discount factor and neural network
configurations. Suboptimal parameter selection may negatively impact both the
algorithm’s performance and its stability.</p>
      <p>Data Limitations: The simulation assumes ideal conditions, including precise sensor
measurements and the absence of unpredictable variables such as weather fluctuations,
traffic incidents, or driver behavior. These real-world factors can substantially influence
the effectiveness of the proposed algorithms.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusion</title>
      <p>The primary objective of this study was to experimentally assess the effectiveness of the DDPG
algorithm in optimizing traffic flow management within urban settings. Based on the conducted
simulations, the following key findings were obtained:

</p>
      <p>The DDPG algorithm exhibited high efficiency and robustness within a simplified
transportation network, achieving consistent convergence in contrast to Q-learning,
which demonstrated comparatively lower stability.</p>
      <p>Experimental results indicated that DDPG consistently increased the average reward
per episode toward the latter stages, suggesting its capacity to adapt to evolving traffic
conditions.
</p>
      <p>The introduced evaluation metric — the speed score — provided a more nuanced
assessment of the algorithm’s influence on traffic dynamics and proved to be a valuable
tool for training and performance evaluation.</p>
      <p>Despite the successful attainment of the study's objective, several limitations must be
acknowledged. These include the simplified nature of the model and the absence of validation
using real-world data, both of which warrant further investigation.</p>
      <p>Future research should focus on extending the model to encompass more complex
transportation networks and on validating the approach using data from actual urban
environments. Such advancements would enable a more comprehensive evaluation of the
DDPG algorithm’s applicability to real-world traffic management systems.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[10] R. Sutton, D. A. McAllister, S. Singh, Y. Mansour. Policy gradient methods for
reinforcement learning with function approximation. NIPS'99: Proceedings of the 13th
International Conference on Neural Information Processing Systems, 1999, pp. 1057 – 1063.</p>
      <p>URL: https://dl.acm.org/doi/10.5555/3009657.3009806.
[11] D. Chernyshev, S. Dolhopolov, T. Honcharenko, V. Sapaiev and M. Delembovskyi. “Digital
Object Detection of Construction Site Based on Building Information Modeling and
Artificial Intelligence Systems”, CEUR Workshop Proceedings, 2022, 3039 pp. 267–279.</p>
      <p>URL: https://doi.org/10.48550/arXiv.1509.02971.
[12] Timothy R., Yanzhi W., A survey for deep reinforcement learning in Markovian
cyberphysical systems: Common problems and solutions, Neural Networks, Volume 153, 2022,
pp. 13-36. URL: https://doi.org/10.1016/j.neunet.2022.05.013.
[13] Aimsun Next user manual, version 24.0.1. URL: https://docs.aimsun.com/next/24.0.1/.
[14] V. Levytskyi, Kruk, O. Lopuha, D. Sereda, V. Sapaiev, O. Matsiievskyi. Use of Deep Learning
Methodologies in Combination with Reinforcement Techniques within Autonomous
Mobile Cyber-physical Systems, 2024 IEEE. URL:
http://dx.doi.org/10.1109/SIST61555.2024.10629589.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. van der</given-names>
            <surname>Pol</surname>
          </string-name>
          .
          <article-title>Deep reinforcement learning for coordination in traffic light control</article-title>
          ,
          <year>August 2016</year>
          . URL: https://www.researchgate.net/publication/315810688_Deep_
          <article-title>Reinforcement_Learning_for _Coordination_in_Traffic_Light_Control_MSc_thesis</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Qasim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mahmood</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Shafait</surname>
          </string-name>
          , “
          <article-title>Rethinking Table Recognition using Graph Neural Networks</article-title>
          ,” in ICDAR,
          <year>2019</year>
          , URL: https://doi.org/10.48550/arXiv.
          <year>1905</year>
          .
          <volume>13391</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>LA</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhatnagar</surname>
          </string-name>
          ,
          <article-title>Reinforcement Learning With Function Approximation for Traffic Signal Control</article-title>
          .
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          ,
          <year>2012</year>
          , vol.
          <volume>12</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>412</fpage>
          -
          <lpage>421</lpage>
          . URL: https://doi.org/10.1109/TITS.
          <year>2010</year>
          .
          <volume>2091408</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Acharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Dash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chaini</surname>
          </string-name>
          .
          <article-title>Fuzzy Logic: An Advanced Approach to Traffic Control. Learning and Analytics in Intelligent Systems</article-title>
          .
          <source>International Journal of Innovation in the Digital Economy</source>
          ,
          <year>2020</year>
          , vol.
          <volume>5</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>10</fpage>
          . URL: https://dx.doi.org/10.4018/ijide.2014010103
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abouheaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gueaieb</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Spinello</surname>
          </string-name>
          ,
          <article-title>"An Adaptive Fuzzy Reinforcement Learning Cooperative Approach for the Autonomous Control of Flock Systems"</article-title>
          .
          <source>ICRA</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8927</fpage>
          -
          <lpage>8933</lpage>
          . URL: https://ieeexplore.ieee.org/document/9561204.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          , &amp;
          <string-name>
            <surname>M. Riedmiller</surname>
          </string-name>
          .
          <article-title>Playing Atari with Deep Reinforcement Learning</article-title>
          ,
          <year>2013</year>
          . URL: https://doi.org/10.48550/arXiv.1312.5602.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          , “
          <article-title>Performance analysis of deep Q-learning based agent for playing Atari 2600 games</article-title>
          ,” in
          <source>2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)</source>
          , Vancouver, Canada, Oct.
          <year>2019</year>
          , pp.
          <fpage>0429</fpage>
          -
          <lpage>0435</lpage>
          . URL: https://ieeexplore.ieee.org/document/8936153.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Babuschkin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Czarnecki</surname>
            ,
            <given-names>W.M.</given-names>
          </string-name>
          et al.
          <article-title>Grandmaster level in StarCraft II using multi-agent reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>575</volume>
          ,
          <fpage>350</fpage>
          -
          <lpage>354</lpage>
          (
          <year>2019</year>
          ). URL: https://www.nature.com/articles/s41586-019-1724-z.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>3rd International Conference for Learning Representations</source>
          , San Diego,
          <year>2015</year>
          . URL: https://arxiv.org/abs/1412.6980.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>