An Approach to Optimizing CO2 Emissions in Traffic
Control via Reinforcement Learning
Olexander Ryzhanskyi1,†, Eduard Manziuk1, ∗,†, Olexander Barmak1,†, Iurii Krak2,3†,
Nebojsa Bacanin4†
1 Khmelnytskyi National University, 11, Instytuts’ka str., Khmelnytskyi, Ukraine
2 Taras Shevchenko National University of Kyiv, 60 Volodymyrska str., Ukraine
3 Glushkov Cybernetics Institute, 40 Glushkov ave., Kyiv, Ukraine
4 Singidunum University, 11000 Belgrade, Serbia


                 Abstract
                 Automotive transport plays a key role in ensuring economic development but is accompanied by
                 significant negative impacts on the environment, particularly in areas where vehicles are
                 concentrated. This article presents an approach that uses reinforcement learning and accounts for
                 traffic flow pressure to optimize the travel time of vehicles through road intersections with the aim
                 of reducing CO2 emissions. The proposed method is based on modern approaches to optimizing
                 traffic light operations, but with an emphasis on ecological aspects. Experimental verification on
                 the synthetic scenario SUMO GRID 4x4 demonstrates the efficiency of the developed algorithm.
                 Comparative analysis shows that it outperforms other algorithms, such as MaxPressure and IDQN,
                 in particular, it improves travel time and queue length by 33%, and reduces CO 2 emissions by 32-
                 33%. The obtained results lay the foundation for further refinement and implementation of the
                 proposed approach in real-world conditions.

                 Keywords
                 traffic signal control, reinforcement learning, reward modeling, pollutant emissions1


1. Introduction
Road transport plays an important role in ensuring economic growth and social development.
It is defined as a key component of the transportation system due to its objective advantages,
which are reinforced by significant achievements in the transport infrastructure of the vast
majority of countries. Road transport is also widely used and is a key priority in economic
development. However, such circumstances are accompanied by significant pressure on the


IntelITSIS’2024: 5th International Workshop on Intelligent Information Technologies and Systems of Information
Security, March 28, 2024, Khmelnytskyi, Ukraine
∗ Corresponding author.
† These authors contributed equally.
   alex@eventcadence.com (O. Ryzhanskyi); eduard.em.km@gmail.com (E. Manziuk);
аlexander.barmak@gmail.com (O. Barmak); yuri.krak@gmail.com (I. Krak); nbacanin@singidunum.ac.rs (N.
Bacanin)
   0009-0000-4664-5195 (O. Ryzhanskyi); 0000-0002-7310-2126 (E. Manziuk); 0000-0003-0739-9678 (O. Barmak);
0000-0002-8043-0785 (I. Krak); 0000-0002-2062-924X (N. Bacanin)
          © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
environment, especially in places where vehicles are heavily congested. Some of these places
are large cities and transportation interchanges. Hence the problem of transport regulation.
There is a wide range of problems for the solution of which information technologies of traffic
improvement are [1]. Accordingly, the development of systems for the formation of climate-
neutral and smart cities is of great importance given the current challenges associated with
climate change and urban growth. Such systems are determined by the need for the following
reasons:

   •   They help to achieve climate neutrality, which is a strategic objective under the
       European Green Pact [2]. Cities make a significant contribution to greenhouse gas
       emissions, and the implementation of systems aimed at optimizing resources and
       reducing environmental impact helps to solve this problem.
   •   Smart cities improve the quality of life for citizens. By optimizing traffic flows and
       reducing air and noise pollution, they contribute to the health and well-being of the
       population.
   •   Such systems help to reduce the urban ecological footprint and support sustainable
       development. They aim to reduce resource consumption and develop more efficient
       strategies for managing urban resources.
   •   The development of such systems promotes political coherence and citizen
       participation in decision-making. This is important for ensuring the effectiveness of
       strategies and achieving climate neutrality.
   •   Smart cities are being integrated into European and global strategies, contributing to
       the achievement of global climate goals and providing synergies with other initiatives.

    To summarize, intelligent systems are a key element of digital transformation and
innovation, enabling cities to use modern technologies more effectively to achieve climate
neutrality and support sustainable development.
    One approach to developing intelligent systems is to use reinforcement learning, which
can be applied to a similar class of tasks [3]. This approach to artificial intelligence allows
systems to learn from the data they receive and gain experience to make optimal decisions in
real time. One of the key challenges is the efficient management of urban resources and
infrastructure to ensure sustainability and efficiency. Reinforcement learning can analyze and
optimize the operation of traffic lights, transportation systems, and other aspects aimed at
reducing emissions and improving energy efficiency. Particularly important is the ability to
train automated transport management systems, which helps to improve traffic flow and
reduce traffic congestion. This has an impact on CO2 emissions and improves air quality in
cities, which in turn affects public health and overall quality of life.
    Thus, the main contribution of the paper is the proposed approach using Reinforcement
Learning to finding the optimal mode of vehicles passing through a traffic light-controlled
crossroads according to the criterion of reducing CO2 emissions.
    The main contributions of the research include:

   •   A new approach to traffic signal control at road intersections using reinforcement
       learning that takes into account the environmental impact of traffic, in particular CO 2
       emissions is proposed.
   •   The MPLightCO2 algorithm is developed, which is an extension of the existing
       MPLight approach with additional consideration of CO2 emissions from vehicles
       queuing to enter and exit the crossroads. This makes it possible to optimize crossroads
       traffic modes in order to reduce environmental impact.
   •   It is proposed to take into account the "traffic flow pressure" metric to determine the
       efficiency of vehicle distribution in the crossroads network and improve throughput.
   •   Experimental verification and comparative analysis of the developed MPLightCO2
       algorithm with other approaches, such as MaxPressure, MPLight, and IDQN, were
       carried out on the synthetic test scenario SUMO GRID 4x4.
   •   The results showed that MPLightCO2 outperforms existing approaches in terms of
       travel time, average queue length, and CO2 emissions, demonstrating increased
       efficiency in both optimizing traffic flow and reducing its environmental impact,
       which allowed reducing queue length by 75-76% and reducing CO2 emissions by 32-
       33%.

   The article is structured as follows. In the Related Works section, we review current
approaches to solving similar problems and formulate the purpose of the paper. The Methods
and Materials section describes the crossroads control system, provides a formalization of its
elements, presents an approach using traffic pressure, describes the DQN agent, the
implementation of deep Q-learning, characterizes the SUMO GRID 4x4 synthetic test and
approaches to assessing the quality of the solutions obtained. The Results and Discussion
section analyzes the results of experimental testing on SUMO GRID 4x4, the quality indicators
of the models, and compares them with other algorithms. The Conclusions and Future Work
section summarizes the results of the study, outlines limitations and directions for further
work.

2. Related Works and Basic Concepts of Approximate Dynamic
   Programming
A review of recent publications on the topic of the study showed that the modern
reinforcement learning approach is actively used to solve such problems. Below is an
overview of these publications. Modern development trends in the field of artificial
intelligence are actively used to implement effective strategies for optimizing traffic flows in
cities. The main goal is to reduce environmental impact through the development and
application of various methods and technologies. Artificial intelligence plays a key role in this
context, helping to create intelligent systems that ensure efficient traffic management. In
particular, deep learning algorithms are used to develop smart traffic light control systems
aimed at dynamic adaptation to changes in traffic flow in real time [4][6]. This not only
minimizes stops and saves fuel, but also has a positive impact on emissions. It is important to
study approaches that would meet all the requirements of AI reliability [7][8][9]. Much of the
papers is aimed at optimizing traffic flows to increase the capacity of transportation routes.
    Another approach is to predict and manage transportation demand, which is becoming
another aspect where machine learning methods are used to accurately analyze passenger
flow data and predict its changes at different times of the day [10][12]. This allows optimizing
the allocation of resources, reducing the number of empty flights and thus contributing to the
reduction of CO2 emissions.
    The use of traffic monitoring and analysis systems based on data from sensors and cameras
allows artificial intelligence algorithms to identify patterns and predict possible traffic
congestion [13]. This opens up opportunities for taking effective measures to avoid
congestion and, therefore, reduce the negative impact on the environment. The use of route
optimization algorithms is also important in the context of reducing the environmental impact
of transport [14] [15]. These algorithms take into account various aspects, such as minimizing
the use of traffic lights and separating environmentally friendly routes.
    The introduction of electric vehicles and autonomous cars is a key step in ensuring
environmentally sustainable transportation. Research on the safety [16] and reliability of
communication equipment is also important [17]. Artificial intelligence is used to optimize
their movement and develop charging station infrastructure [18] [19]. An integrated approach
to optimizing traffic flows in cities allows achieving traffic efficiency, reducing emissions and
promoting sustainable urban transport. Such approaches can improve the conditions of
movement of vehicles along the roads, reducing their delay, improving speed conditions,
which ultimately has a positive impact on transport emissions, improving the environment.
Methods for constructing neural networks are being developed and refined [20][22]. In
pursuit of this goal, research is being conducted using modern reinforcement learning
algorithms to optimize the performance of signal controllers in real time [23] [24]. In this
approach, the state of the crossroads is determined by the parameters of vehicles (lane, speed,
waiting time, queue position) and the actual signal (traffic permission). The main task of
reinforcement learning, which is used in the form of an agent, is to optimize a strategy that
adapts states to the signals. This approach has shown a potential reduction in vehicle delays
of up to 73% compared to a fixed response time [25]. A method of multi-agent reinforcement
learning known as cooperative dual Q-learning is used to solve the complexity of traffic signal
synchronization in large-scale traffic control systems [26]. It uses independent dual Q-
learning methods and an upper confidence bound policy to avoid overestimation problems
that can occur in traditional algorithms. A new reward distribution mechanism and a local
state distribution method are introduced to ensure stable and robust learning. Experiments on
traffic flow scenarios show that the proposed system outperforms state-of-the-art
decentralized algorithms on various traffic metrics.
    Current research is overwhelmingly focused on the use of intelligent systems to optimize
traffic flows and reduce congestion. Strategies should also actively seek to improve the
environmental performance of transportation. Research focuses on traffic optimization rather
than on the full range of environmental aspects and practical measures to reduce the
environmental impact of transport. To achieve environmentally sustainable transport, it is
important to consider not only traffic efficiency, but also improvements in air quality and
overall environmental sustainability.
    Thus, the aim of the study is to develop and test the effectiveness of an approach that uses
reinforcement learning and traffic pressure to optimize vehicle travel times through road
intersections. Particular emphasis is placed on traffic signal control to reduce CO 2 emissions.
The research includes validation of the proposed approach on a synthetic grid4x4 test and
further analysis of the results.
3. Methods and Materials
To systematize the traffic flow at a crossroads, we define typical scenarios, which we will call
"states". At a signalized crossroads, there are incoming and outgoing roads, each of which may
include one or more lanes. For each crossroads, we define a set of states ST, where each
specific state st ∈ ST is associated with a specific direction of traffic.
The states are considered as conflicting if they cannot be activated simultaneously due to
traffic crossroads. At each stage, the signaling controller is responsible for establishing a
specific combination of non-conflicting states in order to optimize the long-term objective
function. For reinforcement learning-based controllers, the signalized crossroads environment
can be modeled using the following description.
    The state space is formed by the mapping of incoming traffic and active states. It is
particularly important to consider the differences in research approaches, where some take
into account high-resolution traffic detection technologies such as real-time observations of
vehicle counts, waiting times, and average speeds, while others are limited to less informative
data such as visibility of queue lengths or waiting times for the first vehicle. In terms of the
envisioned sensing radius, some take a broad approach that covers all entrance roads, but a
more realistic approach is to use a fixed sensing radius rs. This may depend on the
technological capability of the detection to provide reliable results, and take into account local
features such as terrain, visibility limitations, or the presence of obstacles such as buildings or
trees. In the context of alarm management, at each time step, the controller determines a set
of non-conflicting states that are allowed to move, which is indicated by the green light. If
there is a difference between the selected states and the active states, a mandatory yellow
state is automatically entered for a fixed period of time. The assignment of yellow states is a
constraint on the sequence of environmental control, not part of the action space.

     Road number at the                      Lane 2                            Lane 1
         crossroads
              1


               2


               3


               4


Figure 1: Examples of typical states when crossing a crossroads.
    The transition function is determined by the development of traffic after the signal is
activated. This dynamics can be modeled according to a specific traffic model in the simulated
environment or taken from real traffic progress data as part of a real-world implementation.
The reward function typically uses the reduction in queue length as the sum of the respective
scores of all incoming lanes and is expressed as an integral reward. This is effective in
reducing congestion, but does not always normalize the benefits of signal optimization over
the travel time of a particular route. Therefore, other reward functions are used, such as total
delays, crossroads delays, crossroads waiting time, traffic volume, and others.
    However, for proper systematization of traffic at the crossroads, it is necessary to take into
account the action space. At each time step, the controller selects a set of non-conflicting
states that receive permission to move, which is indicated by turning on the green light. If the
selected states differ from the current active states, a mandatory yellow state is automatically
entered for a predefined period of time. The assignment of yellow states is a constraint
imposed on the sequence of environmental control, not part of the action space.
    The consistency of the defined action space and the choice of optimal states determines the
efficiency and safety of traffic at the crossroads. Even taking into account the different
approaches to traffic detection, it is important to consider that parameters such as the
prescribed sensing radius can affect the accuracy and reliability of the data obtained. Let's
formalize the elements of the control system at the crossroads. To do this, let's define the main
parameters taking into account the states.
    Let us denote the set of states as 𝑆𝑡 = {𝑠𝑡1 , 𝑠𝑡2 , . . . , 𝑠𝑡𝑛 }, 𝑠𝑖 is the specific state associated
with the direction of traffic; 𝑛 ∈ 𝑁 - the total number of states at the crossroads, determined
by the number of specific directions of traffic that are selected for modeling or need to be
taken into account.
    The number of states can be determined by a ratio that takes into account the number of
possible options for each direction of traffic on each road. If there are 𝑅𝑑 input roads, each of
which has 𝑑𝑟𝑖 possible directions of movement. Then the total number of states 𝑁 is
calculated by formula:

                                        𝑁 = ∏𝑅𝑑
                                              𝑖=1 𝑑𝑟𝑖 .
                                                                                             (1)
    This formula presents the product of the number of possible directions of movement on
each input road. Thus, the total number of unique states in the state space is determined 𝑁 ∗ .
    Let us define the action domain 𝐴𝑐𝑡 = {𝑎𝑐𝑡1 , 𝑎𝑐𝑡2 , . . . , 𝑎𝑐𝑡𝑚 }, where 𝑎𝑐𝑡𝑗 is the specific
action of the controller at each time step; 𝑚 ∈ 𝑀 is the total number of possible controller
actions. The transition function is defined as follows 𝑇𝑟𝑠: 𝑆𝑡 × 𝐴𝑐𝑡 → 𝑆𝑡, where 𝑇𝑟𝑠(𝑠𝑡𝑖 , 𝑎𝑐𝑡𝑗 )
represents the new state to which the system will move by performing an action 𝑎𝑐𝑡𝑗 in state
𝑠𝑡𝑖 .
    For the reward function, we establish a ratio 𝑅𝑤𝑑: 𝑆𝑡 × 𝐴𝑐𝑡 → 𝑅𝑤𝑑 that determines the
amount of reward for choosing a specific action in a specific state.
    A restriction is set if the selected states differ from the active states, and a mandatory
yellow state is automatically entered for a certain period of time.
    Yellow state restrictions can be represented in the form of the following ratios.
    We denote sets:
    𝑆𝑡𝑎𝑐𝑡𝑣 is the set of active states;
   𝑆𝑡𝑠𝑐𝑙𝑑 is the set of selected states;
   𝑆𝑡𝑦𝑙𝑤 is the set of yellow states to be entered as mandatory for a certain period of time.
   Then the constraint can be expressed by the following formula:

                     𝑆𝑡𝑦𝑙𝑤 = (𝑆𝑡𝑠𝑙𝑐𝑑 ∖ 𝑆𝑡𝑎𝑐𝑡𝑣 ) ∪ (𝑆𝑡𝑎𝑐𝑡𝑣 ∩ 𝑆𝑡𝑠𝑙𝑐𝑑 ).                       (2)
   This formula defines the set of yellow states as the union of those selected states that are
not yet active (𝑆𝑡𝑠𝑙𝑐𝑑 ∖ 𝑆𝑡𝑎𝑐𝑡𝑣 ) and the crossroads of selected and active states (𝑆𝑡𝑎𝑐𝑡𝑣 ∩
𝑆𝑡𝑠𝑙𝑐𝑑 ).
   A mandatory yellow state can be introduced with an additional parameter, for example
Tmylw , which defines the duration of the mandatory yellow state. In this way, it is possible can
define the time interval during which the yellow state will be active after the selected states
have changed. For example:

                                    𝑆𝑡𝑠𝑙𝑐𝑑 (𝑡) ∖ 𝑆𝑡𝑎𝑐𝑡𝑣 (𝑡) , 𝑡 ≤ 𝑇𝑚𝑦𝑙𝑤                       (3)
                    𝑆𝑡𝑦𝑙𝑤 (𝑡) = {
                                               ∅            , 𝑡 > 𝑇𝑚𝑦𝑙𝑤 ,
where 𝑡 is the time;
     𝑆𝑡𝑠𝑙𝑐𝑑 (𝑡) and 𝑆𝑡𝑎𝑐𝑡𝑣 (𝑡) represent the sets of selected and active states at a given time 𝑡.
  Accordingly, the crossroads control system will look like this:

            𝑆𝑡    = 𝑇𝑟𝑠(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 );                                                   (4)
           { 𝑡+1
            𝑅𝑤𝑑(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ); &𝐴𝑐𝑡𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 ∀𝑎𝑐𝑡∈𝐴𝑐𝑡 𝑅𝑤𝑑(𝑆𝑡𝑡 , 𝑎𝑐𝑡).
  The dependencies can be adapted and extended to fit the specific details of the crossroads
management system and to take into account various conditions and constraints.
  The required system parameters can be represented as follows:

          𝑆𝑡𝑡+1 = 𝑇𝑟𝑤(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) + 𝛼 ⋅ (𝑅𝑤𝑑(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) − 𝛽 ⋅ |𝑆𝑐𝑡𝑡 ∩ 𝐴𝑐𝑡𝑡 |)              (5)
where: 𝑆𝑡𝑡+1 is the new state of the system at the next time;
         𝑇𝑟𝑤(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) – transition function, which determines how the system moves from a
state 𝑆𝑡𝑡 to a new state under the influence of the selected actions 𝐴𝑐𝑡𝑡 ;
         𝛼 is the coefficient that takes into account the impact of the reward on the selected
actions;
        𝑅𝑤𝑑(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) – a reward function that determines how effective the choice of a
specific set of actions is in each state;
         𝛽 is the coefficient that takes into account the limitations of yellow states and the
number of conflict states;
         |𝑆𝑐𝑡𝑡 ∩ 𝐴𝑐𝑡𝑡 | is the number of conflict states in the current state and selected actions.
   This formula takes into account the dynamics of the system, the impact of the selected
actions on the state, the reward for these actions, and the limitations on the number of
conflict states.
   It is also possible to take into account the effect of pressure on the movement of vehicles,
and possible changes in the system over time 𝑡. Taking these parameters into account, the
description of the system state will take the following form:
   𝑆𝑡𝑡+1 = 𝑇𝑟𝑤(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) + 𝛼                                                              (6)
                    ⋅ [𝑅𝑤𝑑(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) − 𝛽 ⋅ 𝛿𝑡 ⋅ (|𝑆𝑐𝑡𝑡 ∩ 𝐴𝑐𝑡𝑡 |) − 𝛾 ⋅ 𝑃𝑟𝑠(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 )],
where 𝛿𝑡 is the dynamic coefficient that takes into account changes in the system over time;
       𝛾 is the coefficient that takes into account the effect of pressure on the movement of
transport;
       𝑃𝑟𝑠(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) is the function of pressure on the movement of vehicles, which may
include factors such as traffic density, speed, and other factors.
   Let's take a closer look at the traffic pressure function. One possible approach is to take
into account traffic density and vehicle speed. The pressure function 𝑃𝑟𝑠(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) can be
expressed as follows:

                𝑃𝑟𝑠(𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) = 𝜂 ⋅ 𝐷𝑛𝑠𝑡 (𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) ⋅ 𝑉𝑙𝑠𝑡 (𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ),             (7)
where: 𝜂 is the coefficient that determines the weight of the impact of traffic density and
vehicle speed on pressure;
         𝐷𝑛𝑠𝑡 (𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) is the traffic density, which can be measured by the number of
vehicles in a certain state and with selected actions;
         𝑉𝑙𝑠𝑡 (𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) is the average speed of vehicles in a certain state and with selected
actions.
   The pressure function can be customized according to the specific characteristics of the
crossroads and the optimization goal. This formula allows taking into account both traffic
density and vehicle speed as factors that affect the pressure on the movement of vehicles at
the crossroads.
   Traffic density 𝐷𝑛𝑠𝑡 (𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) can be determined in a variety of ways, depending on the
data availability and specifications of the crossroads control system.
   Some solutions for determining traffic density:

   •   Installing counters on the entrance roads to count the number of vehicles entering the
       crossroads. Traffic density is defined as the number of vehicles per unit of time on
       each input road.
   •   Using modern transportation sensor technologies, such as cameras or sensors, to
       automatically determine traffic density. Video stream or sensor data is analyzed to
       determine the number of vehicles and their movement on the roads.
   •   Using information from transportation agents or transportation monitoring systems
       that can provide traffic density data.
   •   Traffic modeling, which uses mathematical models to simulate traffic flows and
       determine traffic density based on parameters such as speed, number of lanes, and
       others.

   Depending on the conditions and availability of resources, it is possible to choose one or a
combination of these methods to determine the traffic density at a particular time and the
state of the crossroads.
   The general approach to determining the schedule density is as follows:

                                                     𝑉𝑙𝑚𝑡𝑜𝑡𝑎𝑙                               (8)
                              𝐷𝑛𝑠𝑡 (𝑆𝑡𝑡 , 𝐴𝑐𝑡𝑡 ) =            ,
                                                     𝐿𝑛𝑔𝑡𝑜𝑡𝑎𝑙
where: 𝑉𝑙𝑚𝑡𝑜𝑡𝑎𝑙 is the represents the Q volume of vehicles entering the crossroads per unit of
time;
       𝐿𝑛𝑔𝑡𝑜𝑡𝑎𝑙 is the represents the Q total length of all entrance roads leading to the
crossroads.
   The total volume of vehicles Vlmtotal is determined by taking into account the number of
traffic flows and their characteristics on each input road. Let 𝐹𝑙𝑣𝑖𝑗 the traffic flow from
direction 𝑖 to direction𝑗. Then the total vehicle volume is defined as:

                                  𝑉𝑙𝑚𝑡𝑜𝑡𝑎𝑙 = ∑ ∑ 𝐹𝑙𝑣𝑖𝑗 .                                   (9)
                                               𝑖   𝑗
    This dual sum represents the total number of vehicles entering the crossroads through all
possible combinations of input and output directions. Each element represents the number of
vehicles traveling from the respective directions.
    The total length of all input roads 𝐿𝑛𝑔𝑡𝑜𝑡𝑎𝑙 is determined by adding the width of each road,
if it is different for different roads. Let be 𝑊𝑑𝑡𝑖 the width of the input road. Then the total
length is determined by the formula:

                              𝐿𝑛𝑔𝑡𝑜𝑡𝑎𝑙 = ∑(𝐿𝑛𝑔𝑖 ⋅ 𝑊𝑑𝑡𝑖 ).                                  (10)
                                           𝑖
   In order to take into account the factors for determining the volume of vehicles, we write
down the total volume of vehicles 𝑉𝑙𝑚𝑡𝑜𝑡𝑎𝑙 , taking into account the factors related to the
speed of vehicles 𝑉𝑙𝑠𝑖𝑗 on each input road between crossroads 𝑖 and 𝑗. We also introduce the
travel time factor Timeij , which determines the time required to travel the distance between
roads 𝑖 and 𝑗 and take into account the traffic density between the roads in terms of the
number of vehicles per unit time per kilometer 𝐷𝑛𝑠_𝑖𝑛𝑡𝑖𝑗 .
   Then the formula for determining the total volume of vehicles will look like this:

            𝑉𝑙𝑚𝑡𝑜𝑡𝑎𝑙 = ∑ ∑(𝐷𝑛𝑠_𝑖𝑛𝑡𝑖𝑗 𝑖𝑗 ⋅ 𝐿𝑛𝑔𝑖 ⋅ 𝑊𝑑𝑡𝑖 ⋅ 𝑉𝑙𝑠𝑖𝑗 ⋅ 𝑇𝑖𝑚𝑒𝑖𝑗 ).                  (11)
                          𝑖   𝑗
   Thus, we take into account not only the length of each road, but also its width, which is an
important parameter in determining the volume of vehicles and traffic density at the
crossroads.

3.1. Pressure-based Coordination
In order to optimize the traffic flow in the field of pressure management, the concept of flow
pressure is becoming a staple. The main focus is on improving the efficiency of the traffic flow
in general. The crossroads load is defined as the difference between the length of the queues
of vehicles approaching the crossroads and those leaving it. This reflects an imbalance in the
distribution of vehicles.
   The main task is to minimize this pressure in order to achieve equilibrium in the
distribution of vehicles along the network of directions and, as a result, increase the network
capacity.
   The maximum pressure control strategy aims to optimize stability by not only stabilizing
traffic but also maximizing flow using local data from each crossroads.
   The main aspect of this strategy is to optimize traffic signal performance by reducing the
pressure in each state. In real-world maximum pressure control, a greedy approach is used to
achieve a locally optimal decision.
   Algorithm 1: Controlling the maximum pressure for each crossroads.

   1.   Pressure initialization and estimation. For each state at the crossroads, the pressure
        𝑝𝑟𝑠(𝑠𝑡𝑖 ) is calculated, taking into account various aspects of the traffic flow, such as
        density, speed, and vehicle interaction.
   2.   Weight determination of the next state. Given the need to balance the various aspects
        of traffic, determine the next state 𝑠𝑡𝑖+1 as the argument that maximizes pressure
        reduction while taking into account environmental considerations.
   3.   Adaptive pressure control. Taking into account the dynamics of the movement, the
        pressure calculation parameters are adaptively changed to ensure an effective
        response to changing road conditions.
   4.   Synchronization with other road intersections. Optimized state selection, taking into
        account common and interacting factors with other road intersections, to achieve
        harmonious traffic in the system.
   5.   Additional function of emergency states. Additional functions, such as emergency
        management or improved mobility of road users, are tested to ensure that a wide
        range of circumstances are taken into account.

3.2. DQN Agent
Agents in the reinforcement learning method seek to maximize their overall reward within
the objectives of the maximum pressure control method. This increase in reward is
proportional to the overall network throughput, subject to certain constraints.
   Each agent is constrained to a certain subset of the overall system state. For example, for a
typical crossroads that manages traffic flows, the agent's observation covers the active state
and the pressure associated with the flows. In the case of fewer flows, the observation vector
may contain zeros to maintain consistency.
   The agent selects the state at any given time, determining the traffic light configuration.
This approach allows for greater adaptability by allowing the agent to choose the optimal
state to activate.
   The reward for the agent is determined by the reduced pressure at the crossing. This
pressure takes into account the difference in CO2 emissions from vehicles waiting to enter and
exit the crossroads.

3.3. Implementation of Deep Q-learning
Based on the chosen basic model, we apply the DQN method to solve problems related to
various scenarios of traffic lights control at road intersections. The DQN implementation
takes as input the state characteristics of different traffic flows and calculates the Q-value for
each possible action, i.e., traffic state, based on the following Bellman equation:
              𝑄(𝑠𝑖𝑡 , 𝑎𝑐𝑡𝑡 ) = 𝑅𝑟𝑤(𝑠𝑡𝑡 , 𝑎𝑐𝑡𝑡 ) + 𝜆 𝑚𝑎𝑥 𝑄 (𝑠𝑡𝑡+1 , 𝑎𝑐𝑡𝑡+1 ),               (12)
where: 𝑄(𝑠𝑡𝑡 , 𝑎𝑐𝑡𝑡 ) is the represents the Q-value for state 𝑠𝑡𝑡 and action 𝑎𝑐𝑡𝑡 ;
       𝑅𝑟𝑤(𝑠𝑡𝑡 , 𝑎𝑐𝑡𝑡 ) is the reward for performing an action 𝑎𝑐𝑡𝑡 in the state𝑠𝑡𝑡 ;
       𝜆 is the discount factor;
       𝑚𝑎𝑥 𝑄 (𝑠𝑡𝑡+1 , 𝑎𝑐𝑡𝑡+1 ) is the maximum Q-value for the next state 𝑠𝑡𝑡+1 and all
possible actions𝑎𝑐𝑡𝑡+1 .
   This equation estimates the Q-value for the current state and action using information
about future rewards and maximum Q-values.

3.4. Synthetic test SUMO scenario GRID 4x4
The GRID 4x4 scenario in SUMO (Simulation of Urban MObility) is a synthetic test case used
to simulate vehicle movements at road intersections in an urban environment. This scenario is
used to test and evaluate traffic signal control algorithms, road safety, and transportation
efficiency.
    In the GRID 4x4 scenario, the crossroads consists of 4x4 road segments, creating a network
of 16 road intersections. A large number of road intersections allow studying the interaction
of traffic flows, conflicts, and optimal management strategies.
    The main characteristics of the GRID 4x4 test scenario include:

   •   location of 4x4 road intersections;
   •   the number of roads is 16;
   •   a large number of road intersections to study the interaction of traffic flows.

   Such a test scenario allows researching and testing traffic control algorithms at road
intersections in an urban environment.

3.5. Evaluation of the quality of solutions
To determine the environmental impact of transport, in particular CO2 emissions, we use the
developed models with a traffic simulator. This makes it possible to model traffic in an urban
environment and take into account its environmental impact.
   To determine the environmental impact of CO2 emissions in the simulation, we will take
into account the following parameters:

   •   Identification of vehicle types, such as cars, trucks, buses, etc. Each type may have its
       own characteristics in terms of fuel consumption and CO2 emissions.
   •   For each type of vehicle, it is necessary to specify characteristics such as average fuel
       consumption and CO2 emission factor per unit of fuel consumed.
   •   Simulate the movement of vehicles in an urban environment, recording their routes,
       speeds, and fuel consumption while driving.
   •   Based on the recorded data, we calculate CO2 emissions using the entered vehicle
       characteristics.
   The calculation of CO2 emissions is usually based on fuel consumption and vehicle-specific
CO2 emission data. To determine the CO2 emissions, we defined the types of vehicles and
specify their characteristics, such as average fuel consumption and CO2 emission factor per
kilometer. During the simulation, we determine the fuel consumption for each vehicle and the
CO2 emissions.
   The fuel consumption is defined as follows:
                                            𝑁∑                                          (13)
                                   𝐹𝑙𝐶𝑛 = ∑ 𝐹𝑙𝑖
                                            𝑖=1
   Each vehicle belongs to a certain category or type 𝑇𝑟𝑇 that consumes a certain amount of
fuel𝐹𝑙.
   CO2 emissions:
                                       𝑁                                                (14)
                              𝐸𝑚𝑠 = ∑ 𝐹𝑙𝑖 × 𝐹𝑙𝑈𝐶𝑂2 .
                                      𝑖=1
   Emissions are thus determined based on the fuel consumed 𝐹𝑙 by each vehicle unit and
𝐹𝑙𝑈𝐶𝑂2 the CO2 emission factor per unit of fuel consumed.

4. Results and Discussion
To evaluate the effectiveness of our approach, we chose the SUMO GRID 4x4 scenario. This
scenario is characterized by a 4x4 road grid where each crossroads has the same settings and
parameters.
   In the SUMO GRID 4x4 scenario, traffic flows in a network consisting of a 4x4 grid of road
intersections. Each crossroads has the same settings and parameters, making it ideal for
testing the effectiveness of our approach.
   Throughout the experiments, we analyze various aspects such as the environmental impact
of transportation, fuel consumption, and traffic signal efficiency. These aspects are
determined by the scenario parameters and the performance of our approach to optimizing
traffic at road intersections.


Figure 2: Scheme of the system validation scenario.
   To test the performance of the developed algorithm, we used a traffic scenario similar to
the synthetic 4 × 4 symmetric network shown in Figure 1, the 4 × 4 Grid. The study conducted
a comparative analysis of the four algorithms:

   •   Maximum pressure control in which a combination of states with maximum joint
       pressure is enabled.
   •   MPLight algorithm, which uses traffic light control approaches to optimize traffic
       flow [4].
   •   Idependent Deep Q-Network, i.e. independent DQN agents. For each intersection, a
       separate DQN agent is used, each of which uses the same convolutional neural
       network to aggregate information from different lanes. The hyperparameters are left
       by default in the Preferred RL library, except for the target network update interval,
       which was adapted to the environment.
   •   An extension of MPLight that takes into account the environmental impact,
       specifically CO2 emissions from vehicles queuing to enter and exit the crossroads.

    The algorithms were evaluated and compared across various environmental metrics,
providing conclusions on the effectiveness and sustainability of their implementation. The
diagrams below show the dynamics of queue length changes according to different numbers
of training episodes.


Figure 3: Dynamics of queue length change according to different number of training
episodes.
Figure 4: Average waiting time for a vehicle to cross the crossroads.


Figure 5: Average route travel time for all vehicles.
    A comparative analysis of the impact of different traffic signal control algorithms on traffic
efficiency and environmental impacts yielded the following results. The MPLight and
MPLightCO2 methods proved to be the most effective, improving travel time and waiting time
by about 33% compared to the MaxPressure algorithm. IDQN also showed improvement, but
less significant, increasing travel time and waiting time by about 34%. MPLight and
MPLightCO2 were effective in reducing CO2 emissions by about 32-33%, making them
environmentally friendly compared to MaxPressure and IDQN.


Figure 6: CO2 emissions of vehicles crossing the crossroads.
Table 1
Quality indicators of models
           Model               Average duration of    Average queue        CO2, mg/s
                                    travel, s             length
  MPLight                     161,4                  0,57              461908
  MaxPressure                 161,2                  0,61              457459
  MPLightCO2                  151,9                  0,47              423419

   Based on the results for travel time, MPLightCO2 performed the best, with a shorter
average travel time compared to the other models. The average queue length in number of
vehicle for MPLightCO2 is also the shortest, indicating more efficient traffic management and
reduced congestion. MPLightCO2 showed the lowest CO2 emissions of all the models,
indicating its greater environmental efficiency.
   MPLightCO2 performs better in terms of travel time, average queue length, and CO 2
emissions than both MPLight and MaxPressure. Compared to the baseline MaxPressure
model, MPLightCO2 shows an improvement in travel time of 5.77%, in average queue length
of approximately 29.51%, and in CO2 emissions of approximately 7.43%.

Table 2
Results of traffic modeling
           Agent                     Metric           First Value         Last Value
  IDQN                        Duration, s                      242,4               146,9
  MaxPressure                 Duration, s                      160,4               161,2
  MPLight                     Duration, s                      241,0               161,4
  MPLightCO2                  Duration, s                      242,2               151,9
  IDQN                        Waiting time, s                  100,8                 12,4
  MaxPressure                 Waiting time, s                   23,2                 23,2
  MPLight                     Waiting time, s                   99,1                 21,9
  MPLightCO2                  Waiting time, s                  101,2                 17,2
  IDQN                        CO2, mg/s                     674575,8            408332,7
  MaxPressure                 CO2, mg/s                     455778,6            457459,8
  MPLight                     CO2, mg/s                     671710,8            461908,3
  MPLightCO2                  CO2, mg/s                     675505,5            423419,3
  IDQN                        Queue length                      2,53                 0,34
  MaxPressure                 Queue length                      0,61                 0,61
  MPLight                     Queue length                      2,49                 0,57
  MPLightCO2                  Queue length                      2,53                 0,47
  IDQN                        Max queue                         1,67                 0,28
  MaxPressure                 Max queue                         0,47                 0,47
  MPLight                     Max queue                         1,59                 0,44
  MPLightCO2                  Max queue                         1,64                 0,36

  MPLight, MPLightCO2, and IDQN showed similar improvements in queue length and
maximum queue length, reducing them by about 75-76% and 70-71%, respectively, compared
to MaxPressure. Hence, MPLight and MPLightCO2 algorithms seem to be more effective from
both a traffic improvement and environmental perspective than MaxPressure and IDQN in the
studied scenario.
   MPLight and MPLightCO2 performed significantly better than MaxPressure and IDQN in
terms of travel time and waiting time. This may indicate the importance of considering not
only traffic but also environmental aspects in crossroads management. Taking into account
CO2 emissions in the MPLightCO2 algorithm led to a significant reduction in the
environmental impact of traffic. This is an important aspect in the context of urban
sustainability. The reduction in queue length and maximum queue length in the MPLight and
MPLightCO2 algorithms indicates their ability to effectively regulate traffic flow and provide
better crossing capacity at road intersections. It is important to consider how adaptive the
MPLight and MPLightCO2 algorithms are to different traffic conditions. Consider optimizing
the parameters to improve adaptability in different scenarios.


5. Conclusions and Future Work
The study presents an approach that uses reinforcement learning and traffic pressure to
optimize vehicle travel time through a crossroads to reduce CO2 emissions. The proposed
method is based on the modern approaches MPLight, PressLight, but with an emphasis on the
priority of the CO2 emission metric as a key component of decision-making.
   The developed algorithm has been experimentally tested on the synthetic test scenario
SUMO GRID 4x4, which simulates the movement of vehicles at road intersections in an urban
environment. The comparative analysis showed that the MPLight and MPLightCO2
algorithms, which take into account the impact on the environmental situation, proved to be
more effective than MaxPressure and IDQN. They demonstrated an improvement in travel
time and waiting time of up to 33%, a reduction in queue length by 75-76%, and a reduction in
CO2 emissions by 32-33%.
   MPLightCO2 showed the best results among the algorithms compared, with the shortest
travel time, shortest average queue length, and lowest CO2 emissions, indicating its high
performance from both a traffic improvement and environmental perspective.
   The results obtained are preliminary and more testing on different models and
configurations, as well as verification in real-world conditions, is needed. However, the
proposed approach has shown satisfactory control results in a large-scale road network,
which gives grounds for its further improvement and implementation.


6. Acknowledgements

The study was executed as a component of the Horizon Europe Framework Program,
receiving support from the initiative aimed at aligning Ukrainian cities with the mission for
Climate-neutral and smart cities (HORIZON-MISS-2023-CIT-02).
7. References

[1] O. Pavlova, A. Bilinska, A. Holovatiuk, Y. Binkovskyi, D. Melnychuk. Automated System
     for Determining Speed of Cars Ahead. Computer Systems and Information Technologies.
     2023. №3. Pp. 32-39.
[2] C. Fetting, The european green deal. ESDN report, 2020, 53.
[3] A. Haydari, Y. Yılmaz, Deep reinforcement learning for intelligent transportation
     systems: a survey. IEEE Transactions on Intelligent Transportation Systems, 2020, 23 (1),
     11–32.
[4] S. Damadam, M. Zourbakhsh, R. Javidan, A. Faroughi, An intelligent iot based traffic light
     management system: deep reinforcement learning. Smart Cities, 2022, 5 (4), 1293–1311.
[5] C. Chen, H. Wei, N. Xu, G. Zheng, M. Yang, Y. Xiong, K. Xu, Z. Li, Toward a thousand
     lights: decentralized deep reinforcement learning for large-scale traffic signal control.
     Proceedings of the AAAI Conference on Artificial Intelligence, 34 (04) (2020) 3414–3421.
     https://doi.org/10.1609/aaai.v34i04.5744.
[6] T. Wu, P. Zhou, K. Liu, Y. Yuan, X. Wang, H. Huang, D. O. Wu, Multi-agent deep
     reinforcement learning for urban traffic light control in vehicular networks. IEEE
     Transactions      on     Vehicular       Technology,      69    (8)    (2020)     8243–8256.
     doi:10.1109/TVT.2020.2997896.
[7] O. Barmak, I. Krak, E. Manziuk, Diversity as the basis for effective clustering-based
     classification. CEUR-WS, 2020, 2711, 53–67.
[8] E. Manziuk, I. Krak, O. Barmak, O. Mazurets, V. Kuznetsov, O. Pylypiak, Structural
     alignment method of conceptual categories of ontology and formalized domain. CEUR-
     WS, 3003 (2021) 11–22.
[9] E. A. Manziuk, W. Wójcik, O. V. Barmak, I. V. Krak, A. I. Kulias, V. A. Drabovska, V. M.
     Puhach, S. Sundetov, A. Mussabekova, Approach to creating an ensemble on a hierarchy
     of clusters using model decisions correlation. Przegląd Elektrotechniczny, 96 (9) (2020)
     108–113. doi:10.15199/48.2020.09.23.
[10] L. Peng, L. Wang, D. Xia, Q. Gao, Effective energy consumption forecasting using
     empirical wavelet transform and long short-term memory. Energy, 238, (2022) 121756.
     doi:10.1016/j.energy.2021.121756.
[11] Ü. Ağbulut, Forecasting of transportation-related energy demand and co2 emissions in
     turkey with different machine learning algorithms. Sustainable Production and
     Consumption, 2022, 29, 141–157. doi:/10.1016/j.spc.2021.10.001.
[12] S. Jomnonkwao, S. Uttra,V. Ratanavaraha, Forecasting road traffic deaths in thailand:
     applications of time-series, curve estimation, multiple linear regression, and path analysis
     models. Sustainability, (2020) 12 (1), 395. doi:10.3390/su12010395.
[13] T. Afrin, N. Yodo, A survey of road traffic congestion measures towards a sustainable and
     resilient    transportation     system.      Sustainability,   12    (11)     (2020)   4660.
     doi:10.3390/su12114660.
[14] M. Nama, A. Nath, N. Bechra, J. Bhatia, S. Tanwar, M. Chaturvedi, B. Sadoun, Machine
     learning-based traffic scheduling techniques for intelligent transportation system:
     opportunities and challenges. International Journal of Communication Systems, 34 (9)
     (2021) e4814. doi:10.1002/dac.4814.
[15] A. Navarro-Espinoza, O. R. López-Bonilla, E. E. García-Guerrero, E. Tlelo-Cuautle, D.
     López-Mancilla, C. Hernández-Mejía, E. Inzunza-González, Traffic flow prediction for
     smart traffic lights using machine learning algorithms. Technologies, (2022) 10, 5.
     doi:10.3390/technologies10010005.
[16] G. Markowsky, O. Savenko, S. Lysenko, A. Nicheporuk, A. The technique for
     metamorphic viruses’ detection based on its obfuscation features analysis, CEUR-WS,
     2104 (2018) 680–687.
[17] B. Zhurakovskyi, J. Boiko, V. Druzhynin, I. Zeniv, O. Eromenko, Increasing the efficiency
     of information transmission in communication channels. Indonesian Journal of Electrical
     Engineering and Computer Science, 19 (3) (2020) 1306–1315.
[18] Z. Lei, D. Qin, L. Hou, J. Peng, Y. Liu, Z. Chen, An adaptive equivalent consumption
     minimization strategy for plug-in hybrid electric vehicles based on traffic information.
     Energy, 190 (2020) 116409. doi:10.1016/j.energy.2019.116409.
[19] H. He, Y. Wang, J. Li, J. Dou, R. Lian, Y. Li An improved energy management strategy for
     hybrid electric vehicles integrating multistates of vehicle-traffic information. IEEE
     Transactions on Transportation Electrification, 7 (3) (2021) 1161–1172.
     doi:10.1109/TTE.2021.3054896.
[20] A. Arbi, N. Tahri Stability analysis of inertial neural networks: a case of almost anti‐
     periodic environment. Mathematical Methods in the Applied Sciences, 45 (16) (2022)
     10476–10490.
[21] Z. Sabir, A. Arbi, A. F. Hashem, M. A. Abdelkawy Morlet wavelet neural network
     investigations to present the numerical investigations of the prediction differential
     model. Mathematics, , 11 (21) (2023) 1–20.
[22] A. Arbi, J. Cao, M. Es-saiydy, M. Zarhouni, M. Zitane Dynamics of delayed cellular neural
     networks in the stepanov pseudo almost automorphic space. Discrete & Continuous
     Dynamical Systems-Series S 15 (11) (2022).
[23] F. Rasheed, K.-L. A. Yau, R. Md. Noor, C. Wu, Y.-C. Low Deep reinforcement learning for
     traffic signal control: a review. IEEE Access, 8 (2020) 208016–208044.
     doi:10.1109/ACCESS.2020.3034141.
[24] S. Koh, B. Zhou, H. Fang, P. Yang, Z. Yang, Q. Yang, L. Guan, Z. Ji Real-time deep
     reinforcement learning based vehicle navigation. Applied Soft Computing, (2020), 96,
     106694. doi:10.1016/j.asoc.2020.106694.
[25] J. Ault, J. P. Hanna, G. Sharon Learning an interpretable traffic signal control policy.
     arXiv preprint arXiv:1912.11023, (2019).
[26] X. Wang, L. Ke, Z. Qiao, X. Chai, Large-scale traffic signal control using a novel
     multiagent reinforcement learning. IEEE Transactions on Cybernetics, 51 (1) (2021) 174–
     187. doi:10.1109/TCYB.2020.3015811.