Workshop "From Objects to Agents" (WOA 2019)


         A Deep Reinforcement Learning Approach to
            Adaptive Traffic Lights Management
                                 Andrea Vidali, Luca Crociani, Giuseppe Vizzari, Stefania Bandini
                                 CSAI - Complex Systems & Artificial Intelligence Research Center,
                                           University of Milano-Bicocca, Milano, Italy
                                                    name.surname@unimib.it


   Abstract—Traffic monitoring and control, as well as traffic              with advances in machine learning, represents an opportunity
simulation, are still significant and open challenges despite the           for a scientific investigation of the possibility to employ
significant researches that have been carried out, especially on            these virtual environments as tools to explore the outcome of
artificial intelligence approaches to tackle these problems. This
paper presents a Reinforcement Learning approach to traffic                 potential regulation actions within specific situations, within a
lights control, coupled with a microscopic agent-based simulator            Reinforcement Learning [1] framework.
(Simulation of Urban MObility - SUMO) providing a synthetic                    This paper represents a contribution within this line of
but realistic environment in which the exploration of the outcome           research and, in particular, we focus on a simple yet still
of potential regulation actions can be carried out. The paper               studied situation: a single four way intersection regulated by
presents the approach, within the current research landscape,
then the specific experimental setting and achieved results are             traffic lights, that we want to manage through an autonomous
described.                                                                  agent perceiving the current traffic conditions, and exploiting
                                                                            the experience carried out in simulated situations, possibly
   Index Terms—reinforcement learning, traffic lights control,
traffic management, agent-based simulation                                  representing plausibile traffic conditions. The simulations are
                                                                            actually also agent-based, and in particular, for this study,
                                                                            they have been carried out in a tool for Simulation of Urban
                           I. I NTRODUCTION                                 MObility (SUMO) [2] providing a synthetic but realistic envi-
   Traffic monitoring and control and, in general, approaches               ronment in which the exploration of the outcome of potential
supporting the reduction of congestion still represent hot topics           regulation actions can be carried out. An important aspect is
for research of different disciplines, despite the substantial              the fact that SUMO provides an Application Programming
researches that have been devoted to these topics. The global               Interface for interfacing with external programs, therefore we
phenomenon of urbanization (half of the world’s population                  were able to define a plausible set of observable aspects of
was living in cities at the end of 2008 and it is predicted                 the environment, control the traffic lights according to the
that by 2050 about 64% of the developing world and 86% of                   decisions of the learning agent, as well as also to exploit some
the developed world will be urbanized1 ) is in fact constantly              stastics gathered by SUMO to describe the overall traffic flow
changing the situation and making it actually harder to manage              and therefore to define the reward to the actions carried out
such a concentration of population and transportation de-                   by the traffic lights control agent.
mand. Technological developments among which autonomous                        The paper breaks down as follows: we first provide a
driving represents just the most futuristic one (at least from              compact description of the relevant portion of the state of
a popular culture perspective), represent at the same time                  the art in traffic lights management with RL approaches, then
attempts to tackle these issues and further challenges, in terms            we introduce the experimental setting we adopted for this
of potential developments whose introduction requires further               study. The RL approach we defined and adopted will be given
study and analysis of the potential impact and implications.                in Section IV, then the achieved results will be described.
   Artificial Intelligence plays an important role within this              Conclusions and future developments will end the paper.
framework; even not considering the obvious relevance to the                                    II. R ELATED W ORKS
autonomous driving initiative, we focus here on two aspects:
                                                                            A. Reinforcement Learning
(i) the regulation of traffic patterns, especially based on (ii) the
analysis of situations by means of agent-based simulations, in                 One of the acceptations of the goals of AI is to develop
which the behaviour of drivers and other relevant entities is               machines that resemble the intelligent behavior of a human
modeled and computer within a synthetic environment. The                    being. In order to achieve this goal, an AI system should
latter, in particular, have reached a level of sufficient complex-          be able to interact with the environment and learn how
ity, flexibility, and they have proven their capability to support          to correctly act inside it. An established area of AI that
decision makers in the exploration of alternative ways to                   has been proved capable of experience-driven autonomous
manage traffic within urban settings. On the side of regulation             learning is reinforcement learning [1]. Several complex tasks
of traffic patterns, the availability of these simulators, coupled          were successfully completed using reinforcement learning in
                                                                            multiple fields, such as games [3], robotics [4], and traffic
  1 https://population.un.org/wup/                                          signal control.

                                                                       42
                                            Workshop "From Objects to Agents" (WOA 2019)


   In a Reinforcement Learning (RL) problem, an autonomous                   1) State representation: The state is the agent’s perception
agent observes the environment and perceives a state st ,                 of the environment in an arbitrary step. In literature, state space
which is the state of the environment at time t. Then the                 representations particularly differ in information density.
agent chooses an action at which leads to a transition of the                In low information density representations, usually the inter-
environment to the state st+1 . After the environment transition,         section’s lanes are discretized in cells along the length of the
the agent obtains a reward rt+1 which tells the agent how good            lane. Lane cells are then mapped to cells of a vector, which
at was with respect to a performance measure. The goal of the             marks 1 if a vehicle is inside the lane cell, 0 otherwise [6].
agent is to learn the policy π ∗ that maximizes the cumulative            Some approaches include additional information, adopting
expected reward obtained as a result of actions taken while               such a vector of car presence with the addition of a vector
following π ∗ . The standard cycle of reinforcement learning is           encoding the relative velocity of vehicles [7]. The current
shown in Figure 1.                                                        traffic light phase could also be added as a third vector [8].
                                                                             Regarding state representations with high information den-
                                                                          sity, usually the agent receives an image of the current situation
                                                                          of the whole intersection, i.e. a snapshot of the simulator being
                                                                          used; multiple successive snapshots will be stacked together
                                                                          to give the agent a sense of the vehicle motion [9].
                                                                             2) Actions representation: In the context of traffic signal
                                                                          control, the agent’s actions are implemented with different
                                                                          degrees of flexibility and they are described below.
Fig. 1. The reinforcement learning cycle.                                    Among the category of action set with low flexibility, the
                                                                          agent can choose among a defined set of light combinations.
                                                                          When an action is selected, a fixed amount of time will
                                                                          lasts before the agent can select a new configuration [7].
B. Learning in Traffic Signal Control                                     Some works gave the agent more flexibility by defining phase
   Traffic signal control is a well suited application context for        duration with variable length [10]. An agent with a higher
RL techniques: in this framework, one or more autonomous                  flexibility chooses an action at every step of the simulation
agents have the goal of maximizing the efficiency of traffic              from a fixed set of light combinations. However, the selected
flow that drives through one or more intersection controlled              action is not activated if the minimum amount of time required
by traffic lights. The use of RL for traffic signal control is            to release at least a vehicle, has not passed [8], [9]. A slightly
motivated by several reasons [5]: (i) if trained properly, RL             different approach would be to have a defined cycle of light
agents can adapt to different situations (e.g. road accidents,            combinations activated into the intersection. The agent action
bad weather conditions); (ii) RL agents can self-learn without            is represented by the choice of when it is time to switch to
supervision or prior knowledge of the environment; (iii) the              the next light combination, and the decision is made at every
agent only needs a simplified model of the environment                    step [11].
(essentially related to the state representation), since the agent           3) Reward representation: The reward is used by the agent
learns using the system performance metric (i.e. the reward).             to understand the effects of the latest action taken in the latest
   RL techniques applied to traffic signal control address the            state; it is usually defined as a function of some performance
following challenges: [5]                                                 indicator of the intersection efficiently, such as vehicles’
                                                                          delays, queue lengths, waiting times or overall throughput.
   • Inappropriate traffic light sequence. Traffic lights usu-
                                                                             Most of the works include the calculation of the change
     ally choose the phases in a static, predefined policy. This
                                                                          between cumulative vehicle delay between actions, where the
     method could cause the activation of an inappropriate
                                                                          vehicle delay is defined as the number of seconds the vehicles
     traffic light phase in a situation that could cause an
                                                                          is steady [8], [9]. Similarly, the cumulative vehicle staying
     increase in travel times.
                                                                          time can be used, which is the number of seconds the vehicle
   • Inappropriate traffic light durations. Every traffic light
                                                                          has been steady since his entrance in the environment [7].
     phase has a predefined duration which does not depend on
                                                                          Moreover, some works combine multiples indicators in a
     the current traffic conditions. This behavior could cause
                                                                          weighted sum [11].
     unnecessary waitings for the green phase.
   Although the above are potential advantages of the RL
approach to traffic signal control, not all of them have already          C. Adopted models and learning algorithms
been achieved, and (as we will show in the remander of the                   The most recent reinforcement learning research has pro-
paper) the present approach only represents an initial step in            posed multiple possible solutions to address the traffic signal
this overall line of work.                                                control problem, in which it emerges that different algorithms
   In order to apply a RL algorithm, it is necessary to define            and neural networks structure can be used, although some
the state representation, the available actions and the reward            common techniques are necessary but not sufficient in order
functions; in the following, we will describe the most widely             to ensure a good performance.
adopted approaches for the design of these elements within                   The most widely used algorithm to address the problem is
the context of Traffic Signal Control.                                    Q-learning. The optimal behavior of the agent is achieved with

                                                                     43
                                      Workshop "From Objects to Agents" (WOA 2019)


the use of neural networks to approximate Q-values given a                the passage of time is represented in simulation steps. But the
state. Often, this approach includes a Convolutional Neural               agent only operates at certain steps, after the environment has
Network (CNN) to compute the environment state and learn                  evolved enough. Therefore, in this paper every step dedicated
features from an image [9] or a spatial representation [8], [7].          to the agent’s workflow is called agentstep, while the steps
   Genders and Ravi [8] and Gao et al. [7] make use of a                  dedicated to the simulation are simply called ”steps”. Hence,
Convolutional Neural Network to learn features from their                 after a certain amount of simulation steps, the agent starts
spatial representation of the environment. The output of this             its sequence of operations by gathering the current state
network with the current phase is passed to two fully con-                of the environment. Also, the agent calculates the reward
nected layers that connect to the outputs represented by Q-               of the previous selected action, using some measure of the
values. This method showed good results in [7] work against               current traffic situation. The sample of data containing every
different traffic lights policies, such as long-queue-first and           information about the latest simulation steps is saved to a
fixed-times, while in [8] it is compared to a shallow neural              memory and later extracted for a training session. Now the
network, in which (although it shows a good performance) an               agent is ready to select a new action based on the current
evaluation against real-world traffic lights would lead to more           state of the environment, which will resume the simulation
significant results.                                                      until the next agent interaction.
   Mousavi et al. [9] analyzed a double approach to address
the traffic signal control problem. The first approach is value-
based, while the second is policy-based. In the first approach,
action values are predicted by minimizing the mean-squared
error of Q-values with the stochastic gradient-descent method.
In the alternative approach, the policy is learned by updating
the policy parameters in such a way that the probability of
taking good actions increases. A CNN is used as a func-
tion approximator to extract features from the image of the
intersection, wherein the value-based approach the output is
the value of actions, and in the policy-based approach it is a
probability distribution over actions. Results show that both
the approaches achieve good performance against a defined
baseline and do not suffer from instability issues.
   In [10], a deep stacked autoencoders (SAE) neural network              Fig. 2. The agent’s workflow.
is used to learn Q-values. This approach uses autoencoders
to minimize the error between the encoder neural network Q-                  The environment where the agent acts is represented in
value prediction and the target Q-value by using a specific               Figure 3. It is a 4-way intersection where 4 lanes per arm
loss function. It is shown that achieves better performance               approach the intersection from the compass directions, leading
than traditional RL methods.                                              to 4 lanes per arm leaving the intersection. Each arm is 750
                                                                          meters long. On every arm, each lane defines the possible
                                                                          directions that a vehicle can follow: the right-most lane enable
                III. E XPERIMENTAL S ETTING                               vehicles to turn right or going straight, the two central lanes
   The traffic microsimulator used for this research is Simu-             bound the driver to go straight while on the left-most lane
lation of Urban MObility (SUMO) [12]. SUMO provides a                     the left turn is the only direction allowed. In the center of
software package which includes an infrastructure editor, a               the intersection, a traffic light system, controlled by the agent,
simulator interface and an application programming interface              manages the approaching traffic. In particular, on every arm the
(API). These elements enable the user to design and implement             left-most lane has a dedicated traffic light, while the other three
custom configurations and functionalities of a road infrastruc-           lanes share a traffic light. Every traffic light in the environment
ture and exchange data during the traffic simulation.                     operates according to the common european regulations, with
   In this research, the chance of improvement in traffic flow            the only exception being the absence of time between the end
that drives through an intersection controlled by traffic lights          of a yellow phase and the start of the next green phase. In this
will be investigated using artificial intelligence techniques. The        environment pedestrians, sidewalks and pedestrian crossings
agent is represented by the traffic light system that interacts           are not included.
with the environment in order to maximize a certain measure
of traffic efficiency. Given this general premise, the problem            A. Training setup and traffic generation
tackled in this paper is defined as follows: given the state of              The entire training is divided in multiple episodes .The total
the intersection, what is the traffic light phase that the agent          number of episodes is 300. By default, SUMO provides a time
should choose, selected from a fixed set of predefined actions,           frequency of 1 second per step, and the period of each episode
in order to maximize the reward and consequently optimize                 is set at 1 hour and 30 minutes, therefore the total number of
the traffic efficiency of the intersection.                               steps per episode is equal to 5400. 300 episodes of 1.30 hours
   The typical workflow of the agent is shown in Figure 2.                each are equivalent to almost 19 days of continuous traffic, and
It should be underlined that in this application with SUMO,               the entire training takes about 6 hours on a high-end laptop.

                                                                     44
                                              Workshop "From Objects to Agents" (WOA 2019)


                                                                                from every arm evenly distributed. Then, 75 % of gener-
                                                                                ated cars will go straight and 25 % of cars will turn left
                                                                                or right at the intersection.
                                                                             • Low-traffic scenario. 600 cars approach the intersection
                                                                                from every arm evenly distributed. Then, 75 % of gener-
                                                                                ated cars will go straight and 25 % of cars will turn left
                                                                                or right at the intersection.
                                                                             • NS-traffic scenario. 2000 cars approach the intersection,
                                                                                with 90 % of them coming from the North or South arm.
                                                                                Then, 75 % of generated cars will go straight and 25 %
                                                                                of cars will turn left or right at the intersection.
                                                                             • EW-traffic scenario. 2000 cars approach the intersection,
                                                                                with 90 % of them coming from the East or West arm.
                                                                                Then, 75 % of generated cars will go straight and 25 %
                                                                                of cars will turn left or right at the intersection.
                                                                             Each scenario corresponds to one single episode and they
                                                                           cycle during the training always in the same order.
Fig. 3. The environment.
                                                                             IV. D ESCRIPTION OF THE R EINFORCEMENT L EARNING
                                                                                                 A PPROACH
   In a simulated intersection, the traffic generation is a crucial           In order to design a system based on the reinforcement
part that can have a big impact on the agents performance. In              learning framework, it is necessary to define the state rep-
order to maintain a high degree of reality, in each episode the            resentation, the action set, the reward function and the agent
traffic will be generated according to a Weibull distribution              learning techniques involved. It should be noted that the such
with a shape equal to 2. An example is shown in Figure 4.                  agent’s elements in this paper are easily replaceable with a
The distribution is presented in the form of a histogram, where            traffic monitoring system in a real world appliance, compared
the steps of one simulation episode are defined on the x-axis              to others relevant studies in this topic which have higher
and the number of vehicles generated in that step window is                requirements in terms of technical feasibility.
defined on the y-axis. The Weibull distribution approximates
specific traffic situations, where during the early stage the
                                                                           A. State representation
number of cars is rising, representing a peak hour. Then,
the number of incoming cars slowly decreases describing the                   The state of the agent describes a representation of the
gradual mitigation of traffic congestion. Also, every vehicles             situation of the environment in a given agentstep t and it
generated has the same physical dimensions and performance.                is usually denoted with st . To allow the agent to effectively
                                                                           learn to optimize the traffic, the state should provide sufficient
                                                                           information about the distribution of cars on each road.
                                                                              The objective of the chosen representation is to let the
                                                                           agent knows the position of vehicles inside the environment
                                                                           at agentstep t. For this purpose the approach proposed in this
                                                                           paper is inspired to the DTSE [8], with the difference that less
                                                                           information is encoded in this state. In particular, this state
                                                                           design includes only spatial information about the vehicles
                                                                           hosted inside the environment, and the cells used to discretize
                                                                           the continuous environment are not regular. The chosen design
                                                                           for the state representation is focused on realism: recent works
                                                                           on traffic signal controller proposed information-rich states,
Fig. 4. Traffic generation distribution over a single episode.             but in reality they hard to implement since the information
                                                                           used in that kind of representations is difficult to gather.
   The traffic distribution described provides the exact step              Therefore, in this paper will be investigated the chance of
of the episode when a vehicle will be generated. For every                 obtaining good results with a simple and easy-to-apply state
vehicle scheduled, its source arm and destination arm are                  representation.
determined using a random number generator which have a                       Technically, in each arm of the intersection incoming lanes
different seed in every episode, so it is not possible to have two         are discretized in cells that can identify the presence or absence
equivalent episodes. In order to obtain a true adaptive agent,             of a vehicle inside them. In Figure 5 is showed the state
the simulation should include a significant variety of traffic             representation for the west arm of the intersection. Between
flows and patterns [13]. Therefore, four different scenarios are           the beginning of the road and the intersection’s stop line, there
defined and they are the following.                                        are 20 cells. 10 of them are located along the left-only lane
   • High-traffic scenario. 4000 cars approach the intersection            while the others 10 cover the others three lanes. Therefore, in

                                                                      45
                                              Workshop "From Objects to Agents" (WOA 2019)


Fig. 5. Design of the state representation in the west arm of the intersection,        Fig. 6. Graphical representation of the four possible actions.
with cells length.

                                                                                       action, a 4 seconds yellow phase is initiated between the
the whole intersection there are 80 cells. Not every cell has the                      two actions. This means that the number of simulation steps
same size: the further the cell is from the stop line, the longer                      between two same actions is 10, since 1 simulation step is
it is, so more lane length is covered. The choice of the length                        equal to 1 second in SUMO. When the two consecutive actions
of every cell is not trivial: if cells were too long, some cars                        are different, the yellow phase counts as 4 extra simulation
approaching the crossing line may not be detected; if cells                            steps and therefore the total number of simulation steps in
were too short, the number of states required to cover the                             between actions is 14. Figure 7 shows a brief scheme of this
length of the lane increases, bringing to higher computational                         process.
complexity. In this paper, the length of the shortest cells, which
are also the closest to the stop line, is exactly 2 meters longer
than the length of a car.
   In summary, whenever the agent observe the state of the
environment, he will obtain the set of cells that describe the
presence or absence of vehicles in every area of the incoming
roads.

B. Action set                                                                          Fig. 7. Possible differences of simulation steps between actions.
   The action set identifies the possible actions that the agent
can take. The agent is the traffic light system, so doing an
action translates to activate a green phase for a set of lanes                         C. Reward function
for a fixed amount of time, choosing from a predefined set of
green phases. In this paper, the green time is set at 10 seconds                          In reinforcement learning, the reward represents the feed-
and the yellow time is set at 4 seconds. Formally, the action                          back from the environment after the agent has chosen an
space is defined in the set (1). The set includes every possible                       action. The agent uses the reward to understand the result
action that the agent can take.                                                        of the taken action and improve the model for future action
                                                                                       choices. Therefore, the reward is a crucial aspect of the
                                                                                       learning process. The reward usually has two possible values:
                 A = {NSA, NSLA, EWA, EWLA}                                (1)
                                                                                       positive or negative. A positive reward is generated as a
   Every action of set (1) is described below.                                         consequence of good actions, a negative reward is generated
    • North-South Advance (NSA): the green phase is active                             from bad actions. In this application, the objective is to
      for vehicles that are in the north and south arm and wants                       maximize the traffic flow through the intersection over time. In
      to proceed straight or turn right.                                               order to achieve this goal, the reward should be derived from
    • North-South Left Advance (NSLA): the green phase is                              some performance measure of traffic efficiency, so the agent
      active for vehicles that are in the north and south arm                          is able to understand if the taken action reduce or increase the
      and wants to turn left.                                                          intersection efficiency. In traffic analysis, several measures are
    • East-West Advance (EWA): the green phase is active for                           used [14], such as throughput, mean delay and travel time. In
      vehicles that are in the east and west arm and wants to                          this paper, two reward functions are presented which use two
      proceed straight or turn right.                                                  slightly different traffic measures, and they are the following.
    • East-West Left Advance (EWLA): the green phase is                                   1) Literature reward function: The first reward function is
      active for vehicles that are in the east and west arm and                        called literature because it is inspired to similar studies in this
      wants to turn left.                                                              topic. The literature reward function uses as a metric the total
                                                                                       waiting time, defined as in equation (2).
   Figure 6 shows a graphical representation of the four
possible actions.                                                                                                         X
                                                                                                                          n

   If the action chosen in agentstep t is the same as the                                                      twtt =            wt(veh,t)                 (2)
action taken in the last agentstep t − 1 (i.e. the traffic light                                                         veh=1

combination is the same), there is no yellow phase and                                 Where wt(veh,t) is the amount of time in seconds a vehicle veh
therefore the current green phase persists. On the contrary,                           has a speed of less than 0.1 m/s at agentstep t. n represents
if the action chosen in agentstep t is not equal to the previous                       the total number of vehicles in the environment in agentstep t.

                                                                                  46
                                      Workshop "From Objects to Agents" (WOA 2019)


Therefore, twtt is the total waiting time at agentstep t. From            the cars in the intersection captured respectively at agentstep
this metric, the literature reward function can be defined as a           t and t − 1.
function of twtt and is shown in (3)
                                                                          D. Deep Q-Learning
                   rt = 0.9 · twtt−1 − twtt                   (3)
                                                                             The learning mechanism involved in this paper is called
Where rt represents the reward at agentstep t. twtt and                   Deep Q-Learning, which is a combination of two aspects
twtt−1 represent the total waiting time of all the cars in the            widely adopted in the field of reinforcement learning: deep
intersection captured respectively at agentstep t and t − 1. The          neural networks and Q-Learning. Q-Learning [15] is a form of
parameter 0.9 helps with the stability of the training process.           model-free reinforcement learning [16]. It consists of assigning
   In a reinforcement learning application, the reward usually            a value, called the Q-value, to an action taken from a precise
can be positive or negative, and this implementation is no                state of the environment. Formally, in literature, a Q-value is
exception. The equation 3 is designed in such a way that                  defined as in equation (6).
when the agent chooses a bad action it returns a negative
value and when it chooses a good action it returns a positive             Q(st , at ) = Q(st , at )+α(rt+1 +γ·maxA Q(st+1 , at )−Q(st , at ))
value. A bad action can be represented as an action that, in the                                                                          (6)
current agentstep t, adds more vehicles in queues compared                where Q(st , at ) is the value of the action at taken from state
to the situation in the previous agentstep t − 1, resulting in            st . The equation consists on updating the current Q-value
higher waiting times compared to the previous agentstep. This             with a quantity discounted by the learning rate α. Inside the
behavior increases the twt for the current agentstep t and                parenthesis, the term rt+1 represents the reward associated to
consequently the equation 3 assumes a negative value. The                 taking action at from state st . The subscript t + 1 is used to
more vehicles were added in queues for the agentstep t, the               emphasize the temporal relationship between taking the action
more negative rt will be and therefore the worst the action               at and receiving the consequent reward. The term Q(st+1 , at )
will be evaluated by the agent. The same concept is applied               represents the immediate future’s Q-value, where st+1 is next
for good actions.                                                         state in which the environment has evolved after taking action
   The problem with this reward function lays inside the                  at in state st . The expression maxA means that, among the
choiche of the metric, and happens when the following situa-              possible actions at in state st+1 , the most valuable is selected.
tion arise. During the High-traffic scenario, very long queues            The term γ is the discount factor that assumes a value between
appears. When the agent activate the green phase for a long               0 and 1, lowering the importance of future reward compared
queue, the departure of cars creates a wave of movement                   to the immediate reward.
that traverse the entire queue. The reward associated to this                 In this paper, a slightly different version of the equation (6)
phase activation is received not only in the next agentstep,              is used and it is presented in equation (7). This will be called
as it should, but also in very next ones. That is because the             the Q-learning function from this point.
movement wave persists longer compared to the delta step                           Q(st , at ) = rt+1 + γ · maxA Q′ (st+1 , at+1 )       (7)
between actionstep, and the wave triggers the waiting times
of cars in the queue, misleading the agent about the reward               Where the reward rt+1 is the reward received after taking
received.                                                                 action at in state st . The term Q′ (st+1 , at+1 ) is the Q-value
   2) Alternative reward function: The alternative reward                 associated with taking action at+1 in state st+1 , i.e. the next
function uses a metric that is slightly different from the former         state after taking action at in state st . As seen in equation (6),
metric, which is the accumulated total waiting time, defined              the discount factor γ denote a small penalization of the future
in equation (4).                                                          reward compared to the immediate reward. Once the agent
                              X n                                         is trained, the best action at taken from state st will be the
                   atwtt =          awt(veh,t)                 (4)        one that maximize the function Q(st , at ). In other words,
                             veh=1                                        maximizing the Q-learning function means following the best
Where awt(veh,t) is the amount of time in seconds a vehicle               strategy that the agent have learned.
veh has a speed of less than 0.1 m/s at agentstep t, since the               In a reinforcement learning application, often the state space
spawn into the environment. n represents the total number of              is so large that is impractical to discover and save every state-
vehicles in the environment in agentstep t. Therefore, atwtt              action pair. Therefore, the Q-learning function is approximated
is the accumulated total waiting time at agentstep t. With                using a neural network. In this paper, a fully connected deep
this metric, when the vehicle departs but it does not manage              neural network is used, which is composed of an input layer of
to cross the intersection, the value of atwtt does not resets             80 neurons, 5 hidden layers of 400 neurons each with rectified
(unlike the value of twtt ), avoiding the misleading reward               linear unit (ReLU) [17] and the output layer with 4 neurons
associated with the literature reward function, when a long               with linear activation function, each one representing the value
queue build up at the intersection. Once the metric is set, the           of an action given a state. A graphical representation of the
alternative reward function is defined such as in equation (5)            deep neural network is showed in Figure 8

                     rt = atwtt−1 − atwtt                     (5)         E. The training process
Where rt represents the reward at agentstep t. atwtt and                     Experience replay [18] is a technique adopted during the
atwtt−1 represent the accumulated total waiting time of all               training phase in order to improve the performance of the

                                                                     47
                                             Workshop "From Objects to Agents" (WOA 2019)


                                                                             1) Prediction of the Q-values Q(st ), which is the current
                                                                                knowledge that the agent has about the action values from
                                                                                st .
                                                                             2) Prediction of the Q-values Q′ (st+1 ). These represents the
                                                                                knowledge of the agent about the action values starting
                                                                                from the state st+1 .
                                                                             3) Update of Q(st , at ) which represents the value of the par-
                                                                                ticular action at selected by the agent during the simula-
                                                                                tion. This value is overwritten using the Q-learning func-
                                                                                tion described in equation (7). The element rt+1 is the
Fig. 8. Scheme of the deep neural network.
                                                                                reward associated to the action at , maxA Q′ (st+1 , at+1 )
                                                                                is obtained using the prediction of Q′ (st+1 ) and repre-
                                                                                sents the maximum expected future reward i.e. the higher
agent and the learning efficiency. It consists of submitting                    action value expected by the agent, starting from state
to the agent the information needed for learning in the form                    st+1 . It will be discounted by a factor γ that gives more
of a randomized group of samples called batch, instead of                       importance to the immediate reward.
immediately submitting the information that the agent gather                 4) Training of the neural network. The input is the state st ,
during the simulation (commonly called Online Learning). The                    while the desired output is the updated Q-values Q(st , at )
batch is taken from a data structure intuitively called memory,                 that now includes the maximum expected future reward
which stores every sample collected during the training phase.                  thanks to the Q-value update.
A sample m is formally defined as the quadruple (8).                          Once the deep neural network has sufficiently approximated
                                                                           the Q-learning function, the best traffic efficiency is achieved
                      m = {st , at , rt+1 , st+1 }             (8)
                                                                           by selecting the action with the highest value given the
Where rt+1 is the reward received after taking the action at               current state. A major problem in any reinforcement learning
from state st , which evolves the environment into the next state          task is the action-selection policy while learning; whether
st+1 . This technique is implemented to remove correlations in             to take exploratory action and potentially learn more, or to
the observation sequence, since the state of the environment               take exploitative action and attempt to optimize the current
st+1 is a direct evolution of the state st and the correlation             knowledge about the environment evolution. In this paper the
can decrease the training capability of the agent. In Figure 9             ǫ-greedy exploration policy is chosen, and it is represented
is shown a representation of the data collection task.                     by the equation (9). It defines a probability ǫ for the current
                                                                           episode h to choose an explorative action, and consequently a
                                                                           probability 1 − ǫ to choose an exploitative action.

                                                                                                                h
                                                                                                     ǫh = 1 −                            (9)
                                                                                                                H
                                                                              where h is the current episode of training and E is the
                                                                           total number of episodes. Initially, ǫ = 1, meaning that the
                                                                           agent exclusively explores. However, as training progresses,
                                                                           the agent increasingly exploits what it has learned, until it
                                                                           exclusively exploits.

                                                                                             V. S IMULATION R ESULTS
Fig. 9. Scheme of the data collection.
                                                                              The performance of the agents is assessed in two parts:
   As stated earlier, the experience replay technique needs                initially, the reward trend during the training is analyzed. Then,
a memory, which is characterized by a memory size and a                    a comparison between the agents and a static traffic light is
batch size. The memory size represents how many samples                    discussed, with respect to common traffic metrics, such as
the memory can store and is set at 50000 samples. The batch                cumulative wait time and average wait time per vehicle.
size is defined as the number of samples that are retrieved                   One agent is trained using the literature reward function,
from the memory in one training instance and is set at 100. If             while the other one adopts the alternative reward function.
at a certain agentstep the memory is filled, the oldest sample             Figure 10 shows the learning improvement during the training
is removed to make space for the new sample.                               in the Low-traffic scenario of both agents, in term of cumu-
   A training istance consists of learning the Q-value function            lative negative reward i.e the magnitude of actions’ negative
iteratively using the information contained in the batch of                outcomes during each episode. As it can be seen, each agent
samples extracted. Every sample in the batch is used for train-            has learned a sufficiently correct policy in the Low-traffic
ing. From the standpoint of a single sample, which contains                scenario. As the training proceeded, both agents efficiently
the elements {st , at , rt+1 , st+1 }, the following operations are        explore the environment and learn an adequate approximation
executed:                                                                  of the Q-values; then, towards the end of the training, they

                                                                      48
                                           Workshop "From Objects to Agents" (WOA 2019)


try to optimize the Q-values by exploiting the knowledge                         on real-world static traffic lights [19]. In particular, the phases
learnined so far. The fact that the agent with the alternative                   NSA and EWA lasts 30 seconds, the phases NSLA and EWLA
reward function has a better reward curve overall is not a                       lasts 15 seconds and the yellow phase is the same as the agent,
strong evidence of a better performance, since haveing two                       which is 4 seconds.
different reward functions means that different reward values                       In Table I are shown the performance of the two agents,
are produced. The performance difference will be discussed                       compared to the STL. The metric used to measure the per-
later during the static traffic light benchmark.                                 formance difference are the cumulative wait time and the
                                                                                 average wait time per vehicle. The cumulative wait time is
                                                                                 defined as the sum of all waiting times of every car during
                                                                                 the episode, while the average waiting time per vehicle is
                                                                                 defined as the average amount of seconds spent by a vehicle
                                                                                 in a steady position during the episode. These measures are
                                                                                 gathered across 5 episodes and then averaged.


                                                                                                          Literature reward         Alternative reward
                                                                                                                agent                     agent
                                                                                                                       Low-traffic scenario
                                                                                          cwt                    -30                           -47
                                                                                         awt/v                   -29                           -45

Fig. 10. Cumulative negative reward of both agents per episode during the                                              High-traffic scenario
training in the Low-traffic scenario.                                                     cwt                   +145                          +26

   Figure 11 shows the same training data as Figure 10, but                              awt/v                  +136                          +25
referred to the High-traffic scenario. In this scenario, the agent                                                     NS-traffic scenario
with the literature reward shows a significantly unstable reward                          cwt                    -50                           -62
curve, while the other agent’s trend is stable. This behavior is                         awt/v                   -47                           -56
caused by the choiche of using the waiting time of vehicles as a
                                                                                                                       EW-traffic scenario
metric for the reward function, which in situations with long
                                                                                          cwt                    -65                           -65
queues causes the aquisition of misleading rewards. In fact,
by using the accumulated waiting time like in the alternative                            awt/v                   -59                           -58
reward function, vehicles does not resets their waiting times                                               TABLE I
by simply advancing through the queue. As Figure 11 shows,                            AGENTS PERFORMANCE OVERVIEW, PERCENTAGE VARIATIONS
                                                                                               COMPARED TO STL ( LOWER IS BETTER ).
the alternative reward function produces a more stable policy.
In the NS-traffic and EW-traffic scenarios, both agents perform
well since it is a simpler task to exploit.
                                                                                    In general, the alternative reward agent achieves a better
                                                                                 traffic efficiency compared to the literature agent: this is a
                                                                                 consequence of the adoption of a reward function (accumu-
                                                                                 lated waiting time) that more properly discounts waiting times
                                                                                 exceeding a single traffic light cycle. Considering just the
                                                                                 waiting time starting from the last stop of the vehicles, leads
                                                                                 to not sufficiently emphasize the usefulness of keeping longer
                                                                                 light cycles, introducing too many yellow lights situations
                                                                                 and changes, that are effective in low or medium traffic
                                                                                 situations. The fact that the agent is more effectine in low
                                                                                 to medium traffic situations, leads to think that an easy and
                                                                                 almost immediate opportunity would be to separately develop
                                                                                 agents devoted to different traffic situations, having a sort of
                                                                                 controller that monitors the traffic flow and that selects the
Fig. 11. Cumulative negative reward of both agents per episode during the
training in the High-traffic scenario.                                           most appropriate agent configuration. This experimentation
                                                                                 also leads to consider that, however, additional improvements
   In order to truly analyze which agent achieve better perfor-                  would be possible by (i) improving the learning approach
mance, a comparison between the agents and a Static Traffic                      to achieve a more stable and faster convergence, (ii) further
Light (STL) is presented. The STL has the same layout of                         improving the reward fuction to better describe the desired
the agents and it cycle through the 4 phases always in the                       behaviour and to influence the average cycle lengths, that is
following order: [NSA − NSLA − EWA − EWLA]. Moreover,                            more fruitfully short in low traffic situation and long whenever
every phase has a fixed duration and they are inspired by those                  the traffic condition worsens.

                                                                            49
                                              Workshop "From Objects to Agents" (WOA 2019)


     VI. C ONCLUSIONS AND F UTURE D EVELOPMENTS                                              the Real-Time Strategy Game StarCraft II,” https://deepmind.com/blog/
                                                                                             alphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.
   This work has presented a believable exploration of the                              [4] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang,
plausibility of a RL approach to the problem of traffic lights                               D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Qt-opt:
adaptation and management. The work has employed a real-                                     Scalable deep reinforcement learning for vision-based robotic manipu-
                                                                                             lation,” arXiv preprint arXiv: 1806.10293, 2018.
istic and validated traffic simulator to provide an environment                         [5] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P. Komisarczuk,
in which training and evaluating a RL agent. Two metrics for                                 “A survey on reinforcement learning models and algorithms for traffic
the reward of agent’ actions have been investigated, clarifying                              signal control,” ACM Computing Surveys (CSUR), vol. 50, no. 3, p. 34,
                                                                                             2017.
that a proper decription of the application context is just                             [6] W. Genders and S. Razavi, “Evaluating reinforcement learning state
as important as the competence in the proper application of                                  representations for adaptive traffic signal control,” Procedia computer
machine learning approaches for achieving proper results.                                    science, vol. 130, pp. 26–33, 2018.
                                                                                        [7] J. Gao, Y. Shen, J. Liu, M. Ito, and N. Shiratori, “Adaptive traffic signal
   Future works are aimed at further improving achieved                                      control: Deep reinforcement learning algorithm with experience replay
results, but also, within a longer term, at investigating what                               and target network,” arXiv preprint arXiv:1705.02755, 2017.
would be the implications of introducing mutiple RL agents                              [8] W. Genders and S. Razavi, “Using a deep reinforcement learning agent
                                                                                             for traffic signal control,” arXiv preprint arXiv:1611.01142, 2016.
within a road network and what would be the possiblity to                               [9] S. S. Mousavi, M. Schukat, and E. Howley, “Traffic light control using
coordinate their efforts for achieving global improvements                                   deep policy-gradient and value-function-based reinforcement learning,”
over local ones, and also the implications on the vehicle                                    IET Intelligent Transport Systems, vol. 11, no. 7, pp. 417–423, 2017.
                                                                                       [10] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforce-
population, that could perceive the change in the infrastructure                             ment learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3,
and adapt in turn to exploit additional opportunities and poten-                             pp. 247–254, 2016.
tially negating the achieved improvements due to an additional                         [11] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement
traffic demand on the improved intersections. It is important                                learning approach for intelligent traffic light control,” in Proceedings
                                                                                             of the 24th ACM SIGKDD International Conference on Knowledge
to perform analyses along this line of work to understand the                                Discovery & Data Mining. ACM, 2018, pp. 2496–2505.
plausibility, potential advantages or even unintended negative                         [12] D. Krajzewicz, G. Hertkorn, C. Rössel, and P. Wagner, “Sumo (simula-
implications of the introduction in the real world of this form                              tion of urban mobility)-an open-source traffic simulation,” in Proceed-
                                                                                             ings of the 4th middle East Symposium on Simulation and Modelling
ofself-adaptive system.                                                                      (MESM20002), 2002, pp. 183–187.
                                                                                       [13] L. A. Rodegerdts, B. Nevers, B. Robinson, J. Ringert, P. Koonce,
                                                                                             J. Bansen, T. Nguyen, J. McGill, D. Stewart, J. Suggett et al., “Signalized
                              R EFERENCES                                                    intersections: informational guide,” Tech. Rep., 2004.
 [1] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning.         [14] R. Dowling, “Traffic analysis toolbox volume vi: Definition, interpreta-
     MIT press Cambridge, 1998, vol. 135.                                                    tion, and calculation of traffic analysis tools measures of effectiveness,”
 [2] M. Behrisch, L. Bieker, J. Erdmann, and D. Krajzewicz, “Sumo –                          Tech. Rep., 2007.
     simulation of urban mobility: An overview,” in SIMUL 2011, S. . U.                [15] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no.
     of Oslo Aida Omerovic, R. I. R. T. P. D. A. Simoni, and R. I. R. T.                     3-4, pp. 279–292, 1992.
     P. G. Bobashev, Eds. ThinkMind, October 2011. [Online]. Available:                [16] C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. disserta-
     https://elib.dlr.de/71460/                                                              tion, King’s College, Cambridge, 1989.
 [3] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M.              [17] J. N. Tsitsiklis and B. Van Roy, “Analysis of temporal- diffference learn-
     Czarnecki, A. Dudzik, A. Huang, P. Georgiev, R. Powell, T. Ewalds,                      ing with function approximation,” in Advances in neural information
     D. Horgan, M. Kroiss, I. Danihelka, J. Agapiou, J. Oh, V. Dalibard,                     processing systems, 1997, pp. 1075–1081.
     D. Choi, L. Sifre, Y. Sulsky, S. Vezhnevets, J. Molloy, T. Cai, D. Budden,        [18] L.-J. LIN, “Reinforcement learning for robots using neural networks,”
     T. Paine, C. Gulcehre, Z. Wang, T. Pfaff, T. Pohlen, Y. Wu, D. Yogatama,                Ph.D. thesis, Carnegie Mellon University, 1993.
     J. Cohen, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, C. Apps,                [19] P. Koonce and L. Rodegerdts, “Traffic signal timing manual.” United
     K. Kavukcuoglu, D. Hassabis, and D. Silver, “AlphaStar: Mastering                 States. Federal Highway Administration, Tech. Rep., 2008.


                                                                                  50