<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reinforcement Learning-based Control of Traffic Lights in Non-stationary Environments: A Case Study in a Microscopic Simulator</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Denise de Oliveira</string-name>
          <email>edenise@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ana L. C. Bazzan</string-name>
          <email>bazzan@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruno C. da Silva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduardo W. Basso</string-name>
          <email>ewbasso@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Nunes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rosaldo Rossetti</string-name>
          <email>r.rossetti@acm.org</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eug´enio de Oliveira</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto da Silva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Lamb</string-name>
          <email>lamb@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Depto. de Ciˆencias e Tecnologias da Informac ̧ ̃ao - ISCTE, Av. das Forc ̧as Armadas</institution>
          ,
          <addr-line>1649-026, Lisboa</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto de Inform ́atica - UFRGS C. P. 15064 CEP 91.</institution>
          <addr-line>501-970, P. Alegre, RS</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>NIAD&amp;R - Faculdade de Engenharia da Universidade do Porto</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Coping with dynamic changes in traffic volume has been the object of recent publications. Recently, a method was proposed, which is capable of learning in non-stationary scenarios via an approach to detect context changes. For particular scenarios such as the traffic control one, the performance of that method is better than a greedy strategy, as well as other reinforcement learning approaches, such as Q-learning and Prioritized Sweeping. The goal of the present paper is to assess the feasibility of applying the above mentioned approach in a more realistic scenario, implemented by means of a microscopic traffic simulator. We intend to show that to use of context detection is suitable to deal with noisy scenarios where non-stationarity occurs not only due to the changing volume of vehicles, but also because of the random behavior of drivers in what regards the operational task of driving (e.g. deceleration probability). The results confirm the tendencies already detected in the previous paper, although here the increase in noise makes the learning task much more difficult, and the correct separation of contexts harder.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Traffic in urban areas is getting increasingly more complex. Since the expansion of the traffic
network is no longer a socially attainable solution, the existing control systems have to be used in
a more intelligent way in order to increase the traffic throughput and decrease total travel times.
The most common way to control traffic is to deploy traffic lights at street junctions. This can solve
safety problems, but at the same time causes a decrease of flow and an increase of travel times. In
order to minimize the delays, several methods of traffic light control have been proposed.</p>
      <p>Classic traffic engineering approaches for controling traffic work well only in traffic networks with
very well-defined traffic volume patterns, like for instance morning and afternoon peaks. However,
in cities where these patterns are not clear, this approach may not be effective. This is the case in
big cities where business centers are no longer located exclusively downtown. For these situations,
more flexible approaches are needed.</p>
      <p>Classical traffic engineering approaches usually rely on linear programming. This is so because
the alternative of using totally decentralized approaches could impose communication bottlenecks
for the negotiation, and/or would require a traffic expert to mediate possible conflicts. Thus, more
robust approaches than linear programming are not only attractive, but necessary.</p>
      <p>Computer science approaches, on the other hand, sometimes make unrealistic assumptions which
make them difficult to deploy. This paper seeks to come closer to the reality by i) modeling drivers
in a more realistic way, and ii) using a microscopic traffic simulator less dependent on abstract
assumptions, such as those required by simulators based on queues.</p>
      <p>
        The goal of this paper is to extend the approach presented in [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], which uses different
flavors of reinforcement learning to control traffic lights in isolated junctions. Specifically, our work
intends to assess the feasibility of applying that approach in microscopic simulations, since these
are more realistic than the macroscopic approach used before. The use of microscopic simulations
allows us to change parameters associated with specific types of drivers, such as their deceleration
probability. These type of parameters are important since they are a know source of noise in traffic
scenarios. In the previous work, a macroscopic, queue-based modeling was used, and thus many of
the fine-grained properties of drivers could not be modeled.
      </p>
      <p>
        In order to address a finer-grained description of drivers, behavioral models need to possess
deliberative components which implement goal-driven behavior. These components are not
compatible with the basic Nagel-Schreckenberg model, which is based on a cellular-automata, further
detailed in Section 3. A scenario in which drivers are capable of deliberative decision-taking is
presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, in this paper a macroscopic simulator is used and hence it is not possible
to cope with drivers decelerating, for instance. On the other hand, in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] drivers are not particles
as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and in the present paper. Rather, drivers known their routes and learn how to select them
given the traffic state. Traffic state is mostly influenced by traffic lights, which also learn how to
select signal plans to cope with the flow of drivers in their areas (co–evolution).
      </p>
      <p>The present paper is organized as follows. In Section 2 we review related approaches to traffic
control (Section 2.1) and basic concepts of reinforcement learning and their use in non-stationary
environments (Section 2.2). Section 3 introduces some concepts about cellular automata-based
microscopic simulation. Section 4 presents the formulation of the problem as one of
reinforcement learning while Section 5 discusses the simulations and the results obtained. We present the
conclusions in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>Signalized intersections are controlled by signal-timing plans which are implemented at traffic lights.
A signal-timing plan (we use signal plan for short) is a unique set of timing parameters comprising
the cycle length L (the length of time for the complete sequence of the phase changes), and the
split (the division of the cycle length among the various movements or phases). Several plans are
normally required for an intersection to deal with changes in traffic volume.
2.1</p>
      <sec id="sec-2-1">
        <title>MAS based approaches to traffic light control</title>
        <p>So far, several multi-agent approaches to traffic modelling have been proposed. However, usually
the focus of these approaches has been on the management level. On the other hand, our work
focuses on a fine-grained level or traffic flow control.</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a MAS based approach is described in which each traffic light is modeled as an agent,
each having a set of pre-defined signal plans to coordinate with neighbors. Different signal plans
can be selected in order to coordinate in a given traffic direction or during a pre-defined period of
the day. This approach uses techniques of evolutionary game theory: self-interested agents receive
a reward or a penalty given by the environment. Moreover, each agent possesses only information
about its local traffic state. However, payoff matrices (or at least the utilities and preferences of
the agents) are required, i.e these figures have to be explicitly formalized by the designer of the
system.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the formation of groups of traffic lights was considered in an application of a technique
of distributed constraint optimization, called cooperative mediation. However, in this case the
mediation was not decentralized: group mediators communicated their decisions to the mediated
agents in their groups and these agents just carried the tasks out. Also, the mediation process could
require a long time in highly constrained scenarios, imposing a negative impact in the coordination
mechanism. For these reasons, a decentralized and swarm-based model of task allocation was
developed in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], in which the dynamic group formation, without mediation, combined the advantages of
those two previous works (decentralization via swarm intelligence and dynamic group formation).
        </p>
        <p>
          Camponogara and Kraus [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] have studied a simple scenario with only two intersections, using
stochastic game-theory and reinforcement learning. Their results with this approach were better
than a best-effort (greedy) technique, better than a random policy, and also better than Q-learning.
Also, in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] a set of different techniques were applied in order to try to improve the learning ability
of the agents in a simple scenario.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Reinforcement Learning</title>
        <p>
          Usually, Reinforcement Learning (RL) problems are modeled as Markov Decision Processes (MDPs).
These are described by a set of states, S, a set of actions, A, a reward function R(s, a) → ℜ and
a probabilistic state transition function T (s, a, s′) → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. An experience tuple hs, a, s′, ri denotes
the fact that the agent was in state s, performed action a and ended up in s′ with reward r.
2.2.1
        </p>
        <p>Q-Learning
Reinforcement learning methods can be divided into two categories: model-free and model-based.
Model-based methods assume that the transition function T and the reward function R are available.
Model-free systems, such as Q-learning, on the other hand, do not require that the agent have access
to information about how the environment works. Q-Learning works by estimating good state–
action values, the Q-values, which are a numerical estimator of quality for a given pair of state and
action.</p>
        <p>Q-Learning algorithm approximates the Q-values Q(s, a) as the agent acts in a given
environment. The update rule for each experience tuple hs, a, s′, ri is:</p>
        <p>Q(s, a) = Q(s, a) + α (r + γmaxa′ Q(s′, a′) − Q(s, a))
where α is the learning rate and γ is the discount for future rewards. When the Q-values are nearly
converged to their optimal values, it is appropriate for the agent to act greedily, that is, to always
choose the action with the highest Q-value for the current state.
2.2.2</p>
        <p>Prioritized Sweeping
Prioritized Sweeping (PS) is somewhat similar to Q-Learning, except for the fact that it continuously
estimates a single model of the environment. Also, it updates more than one state value per iteration
and makes direct use of state values, instead of Q-values. The states whose values should be updated
after each iteration are determined by a priority queue, which stores the states priorities, initially set
to zero. Also, each state remembers its predecessors, defined as all states with a non-zero transition
probability to it under some action. At each iteration, new estimates Tˆ and Rˆ of the dynamics are
made. The usual manner to update Tˆ is to calculate the maximum likelihood probability. Instead
of storing the transitions probabilities in the form T (s, a, s′), PS stores the number of times the
transition (s, a, s′) occurred and the total number of times the situation (s, a) has been reached.
This way, it is easy to update these parameters and to compute Tˆ as | (|s(,sa,,as)′|) | .
2.2.3</p>
        <p>RL in non-stationary environments
When dealing with non-stationary environments, both the model-free and the model-based RL
approaches need to continuously relearn everything from scratch, since the policy which was
calculated for a given environment is no longer valid after a change in dynamics. This causes a
performance drop during the readjustment phase, and also forces the algorithm to relearn policies
even for environment dynamics which have been previously experienced.</p>
        <p>
          In order to cope with non-stationary environments, alternative RL methods have been proposed,
e.g. the mechanisms proposed by Choi and colleagues [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and Doya and colleagues [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
Unfortunately, their approaches require a fixed number of models, and thus implicitly assume that the
approximate number of different environment dynamics is known a priori. Since this assumption
is not always realistic, our method tries to overcome it by incrementally building new models.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Microscopic Simulation</title>
      <p>There are two main approaches to the simulation of traffic: macroscopic and microscopic. The latter
allows the description of each road user as detailed as desired, given computational restrictions, thus
permitting the modeling several types of drivers’ behaviors. Multi-agent simulation is a promising
technique for microscopic traffic models as the drivers’ behavior can be described in a way that
incorporates complex and individual decision-making.</p>
      <p>
        In general, microscopic traffic flow models describe the act of driving on a road, that is, the
perception and reaction of a driver on a short time-scale. In these models, the drivers are the basic
entities, and their behaviors are described according to several different types of mathematical
formulations, (e.g. continuous models such as car-following). Other models are not continuous,
but instead use cellular automata techniques. In particular, we use the Nagel-Schreckenberg model
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] due to its simplicity. The Nagel-Schreckenberg cellular automaton-model represents a minimal
model in the sense that it is capable to reproduce basic several features of real traffic using only
very few behavioral rules.
      </p>
      <p>In the following, the Nagel-Schreckenberg model for single-lane traffic is briefly reviewed. The
road is considered to be subdivided in cells with a length varying around 7.5 or 5 meters (for
highways or urban traffic respectively). Each cell is either empty or occupied by exactly one
vehicle, which has an integer speed vi ∈ {0, . . . , vmax}, and a maximum speed vmax. The dynamics
of vehicle motion is described by the following rules (parallel dynamics):</p>
      <sec id="sec-3-1">
        <title>R1 Acceleration: vi ← min(vi + 1, vmax),</title>
        <p>R2 Deceleration: in order to avoid accidents, vi′ ← min(vi, gap),
R3 Randomizing: with a certain probability p</p>
        <p>do vi′′ ← max(vi′ − 1, 0),
R4 Movement: xi ← xi + vi′′.</p>
        <p>The variable gap denotes the number of empty cells in front of the vehicle at cell i. A time-step
corresponds to Δt ≈ 1 sec, the typical time a driver needs to react. Every driver described by
the Nagel-Schreckenberg model can be seen as a reactive agent: autonomous, situated in a discrete
environment, and having (potentially individual) characteristics such as its maximum speed vmax
and deceleration probability p. During the process of driving, it perceives its distance to the
predecessor (gap), its own current speed v, etc. These information are processed using rules
R1R3, and changes in the environment are made using R4. The first rule describes one goal of the
agent (to move according to the maximum speed possible, vmax); its other goal is to drive safely i.e.
not to collide with its predecessor (R2). These first two rules describe deterministic behavior, which
implies that stationary state of the system is determined only by the initial conditions. However,
real drivers do not react in this optimal way. Instead, sometimes they vary their driving behavior
without any obvious reasons. This uncertainty in behavior is reflected by the braking noise p (R3).
Braking noise mimics the complex interactions with the other agents, such overreaction in braking
and delayed reaction while accelerating. Finally, to carry the last rule out means that the agent will
act on the environment and move according to its current speed.</p>
        <p>
          In the next section, we discuss the use of the microscopic modeling in a scenario in which the
traffic lights learn to cope with the different traffic patterns. All experiments were made using the
ITSUMO tool [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] which implements the Nagel-Schreckenberg model.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Using RL-CD for Traffic Lights Control</title>
      <p>
        Traffic optimization is a hard problem because it deals with highly dynamic and non-stationary
flow patterns. Thus, it is important to state our assumptions regarding the type of dynamics we
expect to happen. Specifically, the class of non-stationary environments that we deal with is similar
to the one studied by Hidden-Mode MDPs researchers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We assume that the following properties
hold: 1) environmental changes are confined to a small number of contexts, which are stationary
environments with distinct dynamics; 2) the current context cannot be directly observed, but can
be estimated according to the types of transitions and rewards observed; 3) environmental context
changes are independent of the agent’s actions; and 4) context changes are relatively infrequent.
      </p>
      <p>In a traffic scenario, these assumptions mean that flow patterns are non-stationary but they
can be nearly divided in stationary dynamics. We do not assume, however, that this division must
be known a priori. In fact, one of the interesting aspects of our method is exactly its capability of
automatically partitioning the environment dynamics into relevant partial models. Moreover, we
assume that the current flow pattern is not given or directly perceivable, but can be estimated by
observing attributes such as the queue of vehicles, street densities, etc.</p>
      <p>
        Our approach relies on RL methods because they provide online learning capabilities and do
not rely on offline analysis. In the next sections we review the RL method called RL-CD or
Reinforcement Learning with Context Detection [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The RL-CD mechanism assumes that the
use of multiple partial models of the environment is a good approach for dealing with non-stationary
scenarios such as traffic control. The use of multiple models makes the learning system capable
of partitioning the knowledge in a way that each model automatically assumes for itself the
responsibility for “understanding” one kind of flow pattern. To each model, we assign an optimal
policy, which is a mapping from traffic conditions to signal plans, and a trace of prediction error of
transitions and rewards, responsible for estimating the quality of a given partial model. Moreover,
the creation of new models is controlled by a continuous evaluation of the prediction errors
generated by each partial model. In the following subsections we first describe how to learn contexts
(i.e, estimate models for traffic patterns), and then we show how to detect and switch to the most
adequate model given a sequence of observations.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Learning traffic patterns</title>
        <p>Henceforth we use the terms traffic pattern and context interchangeably. A context consists of
a nearly stable set of traffic flow characteristics. In terms of RL, a context consists of a given
environment dynamics, which is experienced by the agent as a class of transitions and rewards.
The mechanism for detecting context changes relies on a set of partial models for predicting the
environment dynamics. A partial model m contains a function Tˆm, which estimates transition
probabilities, and also a function Rˆm, which estimates the rewards to be received.</p>
        <p>
          For each partial model, classic model-based RL methods such as Prioritized Sweeping and Dyna
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] may be used to compute a locally optimal policy. The policy of a partial model is described
by the function πm(s) and it is said to be locally optimal because it describes the optimal actions
for each state in a specific context. For example, if the dynamics of a non-stationary environment
Θ can be described by m1, then πm1 will be the associated optimal policy. If the non-stationarity
of Θ makes itself noticeable by making πm1 suboptimal, then the system creates a new model, m2,
which would predict with a higher degree of confidence the transitions of the newly arrived context.
Associated with m2, a locally optimal policy πm2 would be used to estimate the best actions in m2.
Whenever possible, the system reuses existing models instead of creating new ones.
        </p>
        <p>Given an experience tuple ϕ ≡ hs, a, s′, ri, we update the current partial model m by adjusting its
ˆ ˆ
model of transition and rewards by ΔTm,ϕ and ΔRm,ϕ, respectively. These adjustments are computed
as follows:</p>
        <p>ˆ 1
ΔTm,ϕ(κ) = Nm(s,a)+1</p>
        <p>′
τκs − Tˆm(s, a, κ)
∀κ ∈ S
such that τ is the Kronecker Delta:</p>
        <p>ˆ 1
ΔRm,ϕ = Nm(s,a)+1</p>
        <p>r − Rˆm(s, a)
′
τκs =
1, κ = s′
0, κ 6= s′</p>
        <p>The effect of τ is to update the transition probability T (s, a, s′) towards 1 and all other
transitions T (s, a, κ), for all κ ∈ S, towards zero. The quantity Nm(s, a) reflects the number of times, in
model m, action a was executed in state s. We compute Nm considering only a truncated (finite)
memory of past M experiences:</p>
        <p>A truncated value of N acts like a learning coefficient for Tˆm and Rˆm, causing transitions to
be updated faster in the initial observations and slower as the agent experiments more. Having the
ˆ ˆ
values for ΔTm,ϕ and ΔRm,ϕ, we update the transition probabilities:
and also the model of expected rewards:
ˆ
Tˆm(s, a, κ) = Tˆm(s, a, κ) + ΔTm,ϕ(κ), ∀κ ∈ S</p>
        <p>ˆ
Rˆm(s, a) = Rˆm(s, a) + ΔRm,ϕ
(1)
(2)
(3)
(4)
(5)
(6)
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Detecting changes in traffic patterns</title>
        <p>In order to detect changes in the traffic patterns (contexts), the system must be able to evaluate
how well the current partial model can predict the environment. Thus, an error signal is computed
for each partial model. The instantaneous error is proportional to a confidence value, which reflects
the number of times the agent tried an action in a state. Given a model m and an experience tuple
ϕ = hs, a, s′, ri, we calculate the instantaneous error em,ϕ and the confidence cm(s, a) as follows:
cm(s, a) =</p>
        <p>Nm(s, a) 2</p>
        <p>M
ˆ ˆ
em,ϕ = cm(s, a) Ω(ΔRm,ϕ)2 + (1 − Ω) X ΔTm,ϕ(κ)2
κ∈S
where Ω specifies the relative importance of the reward and transition prediction errors for the
assessment of the model’s quality. Once the instantaneous error has been computed, the trace of
prediction error Em for each partial model is updated:</p>
        <p>Em = Em + ρ em,ϕ − Em
where ρ is the adjustment coefficient for the error.</p>
        <p>The error Em is updated after each iteration for every partial model m, but only the active
model is corrected according to equations 2 and 3. A plasticity threshold λ is used to specify until
when a partial model should be adjusted. When Em becomes higher than λ, the predictions made
by the model are considered sufficiently different from the real observations. In this case, a context
change is detected and the model with lowest error is activated. A new partial model is created
when there are no models with trace error smaller than the plasticity. The mechanism starts with
only one model and then incrementally creates new partial models as they become necessary. In
other words, whenever the model of predictions is considered sufficiently inadequate, we assume
that the dynamics of the environment have changed, and then we either select among some other
good model, or create a new one from scratch.</p>
        <p>The parameter λ must be adjusted according to how non-stationary the environment seems to
be. Among the factors that influence this we can name i) how coarse the state discretization is; ii)
the number of Markov transitions whose probabilities change when the context is altered; iii) the
quantity of aliased or partially-observable states. After λ is adjusted and exploration takes place,
during learning, we expect a finite number of models to be created. The total number of models
should stabilise, and these models are expected to correctly describe the stationary dynamics that
compose the overall non-stationary scenario. In our experiments, we have seen no relation between
λ and the RL learning rate α, since the first measures the maximum acceptable prediction error
(related to the environment model), and the second measures how fast the Q-values are updated
(related to the policy).</p>
        <p>Pseudo-code for RL-CD is presented in Algorithm 1.</p>
        <p>The newmodel() routine is used to create a new partial model and initializes all estimates and
variables to zero, except Tm, initialized with equally probable transitions. The values of parameters
M , ρ, Ω and λ must be tuned according to the problem. Small values of ρ are appropriate for noisy
environments; higher values of M define systems which require more experiences in order to gain
confidence regarding its predictions; in general applications, Ω might be set to 0.5; the plasticity
λ should be set to higher values according to the need for learning relevant (non-noisy) but rare
transitions.
5
5.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Scenario and Experiments</title>
      <sec id="sec-5-1">
        <title>Scenario</title>
        <p>Our scenario consists of a traffic network which is a 3x3 Manhattan-like grid, with a traffic light
in each junction. Figure 1 depicts this network, where the 9 nodes correspond to traffic lights and
the 24 edges are directed (one-way) links, each having a specific maximum allowed velocity. In
Figure 1, the links depicted in black, dotted, and gray have their maximum allowed velocities set
to 54Km/h, 36km/h, and 18Km/h respectively. These velocities correspond to 3, 2, and 1 cell per
time-step.</p>
        <p>Each link has capacity for 50 vehicles. Vehicles are inserted by sources and removed by sinks,
depicted in the figure as diamonds and squares, respectively, in Figure 1. In order to model the
nonstationarity of the traffic behavior, our scenario assumes three different traffic patterns (contexts).
Each consists of a different vehicle insertion distribution. These three contexts are:
• Low : low insertion rate in the both North and East sources, allowing the traffic network to
perform relatively well even if the policies are not optimal (i.e., the network is undersaturated);
• Vertical : high insertion rate in the North sources (S0, S1, and S2), and average insertion rate
in the East (S3, S4, and S5);
• Horizontal : high insertion rate in the East sources (S3, S4, and S5), and average insertion
rate in the East (S0, S1, and S2).</p>
        <p>Vehicles do not change directions during the simulation, and upon arriving at sinks they are
immediately removed.</p>
        <p>We modeled the scenario in a way that each traffic light is controlled by exactly one agent,
each agent making only local decisions. Even though decisions are local, we assess how well the
mechanism is performing by measuring global performance values. By using reinforcement learning
to optimize isolated junctions, we implement decentralized controllers and avoid expensive offline
processing.
TL0
TL3
TL6
R3
TL1
TL4
TL7</p>
        <p>R4
R0
R1
R2
54Km/h
18Km/h
36Km/h</p>
        <p>54Km/h
TL2
TL5
TL8
R5</p>
        <p>S4
S5
S6</p>
        <p>As a general measure of performance or effectiveness of traffic control systems, usually one seeks
to optimize a weighted combination of number of stopped vehicles and total travel time. Here we
measure the total number of stopped vehicles, since this is an attribute which can be easily delivered
by real inductive loop detectors.</p>
        <p>At each time, we discretize the number of cars in a link in a way that its state can be either
empty, regular or full, based on its occupation. The state of an agent is given by the states of
the links arriving in its corresponding traffic light. Since there are two one-way links arriving at
each traffic light (one from north and one from east), there are 9 possible states for each agent.
The reward for each agent is given locally by the squared sum of the incoming links’ queues.
Performance, however, is evaluated for the whole traffic network by summing the queue length of
all links, including external queues (queues containing vehicles trying to enter the scenario).</p>
        <p>We have already discussed that traffic lights usually need a set of signal plans in order to deal
with the different traffic conditions and/or time of the day. We consider here three plans, each with
two phases: one allowing green time to direction north-south (NS), and other to direction east-west
(EW). Each one of the three signal plans uses different green times for phases: signal plan 1 gives
equal green times for both phases; signal plan 2 gives priority to the vertical direction; and signal
plan 3 gives priority to the horizontal direction. All signal plans have cycle time of 60 seconds and
phases of either 42, 30 or 18 seconds (70% of cycle time for the preferential direction, 50% of cycle
time and 25% of cycle time for non-preferential direction). The signal plan with equal phase times
gives 30 seconds for each direction (50% of the cycle time); the signal plan which prioritizes the
vertical direction gives 42 seconds to the phase NS and 18 seconds to the phase EW; and the signal
plan which prioritizes the horizontal direction gives 42 seconds to the phase EW and 18 seconds to
the phase NS.</p>
        <p>In our simulation, one time-step consists of two entire cycles of signal plan, that is, 120 seconds
of simulated traffic. The agent’s action consists of selecting one of the three available signal plans
at each simulation step.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Experiments</title>
        <p>In the following experiments we compare our method against a greedy policy and against classic
model-free and a model-based reinforcement learning algorithms.</p>
        <p>We change the context (traffic pattern) every 100 time-steps, which corresponds to 12000
seconds. The probability of a vehicle being inserted, at each second, during each traffic pattern by
the sources is given in Table 1. Moreover, all comparisons of control methods’ performance make
use of the metric described in section 5.1, that is, the total number of stopped vehicles in all links,
including external queues, during the simulation time. The use of this metric implies that the lower</p>
        <sec id="sec-5-2-1">
          <title>Traffic pattern</title>
          <p>Low
Vertical
Horizontal</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>Sources</title>
          <p>S0, S2, S3 S4, S5, S6
0.05 0.05
0.20 0.05
0.05 0.20
the value in the graph, the better the performance.</p>
          <p>We have first implemented a greedy policy, for the sake of comparison. The greedy solution
is a standard decentralized solution for traffic-responsive networks in which there is no type of
coordination. Each agent takes decisions based solely on the state of its input links, and then
selects the signal plan which gives priority to the direction with higher occupation. If the state of
both links is the same, then the greedy agent selects the signal plan with equal time distribution.
Fixed Plans
Greedy
QL
PS
RL-CD
75
50
25
00
12000
24000
36000
48000
60000
72000
84000
Figure 2 shows the comparison between our method, fixed signal plans, greedy selection, and
two standard RL methods, namely Q-Learning and Prioritized Sweeping, in a scenario where the
vehicles have no probability of deceleration (see rule R3 in Section 3). We can see on this graph
that the performance of all learning methods is very similar to the greedy selection mechanism.
Using no learning, i.e. using fix plans, is of course not efficient. This is so because fix plans do not
respond to the change in number of vehicles as it gives the same priority to both directions.</p>
          <p>That figure also shows that, in an ideal scenario, with no noise (no deceleration), our mechanism
has a performance similar to the greedy and to other learning methods. Although in this case the
greedy strategy is the best, and therefore no learning occurs, we can imagine that better solutions
exist in which the system learns to identify and to deal with the different types of traffic patterns.</p>
          <p>Next, in Figure 3 we show the performance of all methods in a scenario in which vehicles have a
probability of deceleration equal to 0.1. In this case, the braking noise makes the greedy method and
Q-Learning have a lower performance compared to our method and PS. Q-learning performs bad
because it learns to cope with equal flow during the first context and then has troubles relearning
the next contexts.</p>
          <p>The results in the other two scenarios (scenarios with deceleration 0.2 and 0.3) are shown</p>
          <p>Fixed Plans
Greedy
QL
PS</p>
          <p>RL-CD
12000
24000
36000
48000
60000
72000
84000
in Figure 4 and Figure 5 respectively. As we can observe, the deceleration p causes the traffic
lights to always observe the same state (in this case, high occupation in all input lanes), which
means that none of the learning methods is able to cope with the over-saturated network. For
this same reason, RL-CD is not able to build interesting models of the relevant attributes of the
dynamics. However, since Q-Learning is model-free, it is less prone to wrong bias caused by the
non-stationarity. Prioritized Sweeping, on the other hand, tries to build a single model for the
environment and ends up with a model which mixes properties of different traffic patterns. For this
reason, it can at most calculate a policy which is a compromise for several different (and sometimes
opposite) traffic patterns. The fixed plan and the learning mechanisms have the same performance,
indicating that the learning mechanisms are using only the fixed plan.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>Centralized approaches to traffic signal control cannot cope with the increasing complexity of urban
traffic networks. A trend towards decentralized control was already pointed out by traffic experts
in the 1980’s and traffic responsive systems for control of traffic lights have been implemented.</p>
      <p>
        Here we have used reinforcement learning methods capable of dealing with non-stationary traffic
patterns in a microscopic simulator. This kind of simulator allows us to test several sources of
nonstationarity. In this paper, we considered specifically the deceleration probability, for it is an
important characteristic of drivers not present in macroscopic models of simulation. The results
show that in face of higher degree of non-stationarity the learning mechanisms have more difficulty in
recognizing and discriminating the states than when the macroscopic model was used (for example,
in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). Moreover, we have presented empirical results which show that RL-CD is more efficient
than classical Q-Learning and a greedy method in a non-stationary noisy scenario, provided that
real traffic state is not aliased by coarse discretization. In other words, the use of reinforcement
learning techniques, such as RL-CD, Q-Learning and Prioritized Sweeping, requires that either the
network is not over-saturated at all times, or that states are discretized in a finer-grained manner.
      </p>
      <p>The conclusion regarding the bad performance of Q-learning can be generalized to any scenario
involving learning in traffic lights, and was in fact expected. This is so because Q-learning is based
on an experimentation over the whole table of state-action pairs. However, it is clear that some of
12000
24000
36000
48000
60000
72000
84000
Fixed Plans
Greedy
QL
PS</p>
      <p>RL-CD
12000
24000
36000
48000
60000
72000
84000
these pairs make no sense and should not have been tried. For example, if T L4 in Figure 1 has
state empty in the vertical direction and full in the horizontal direction, then Q-learning should
not even try to run the action signal plan 2 (which priorizes the vertical direction), but it does for
the sake of exploration, thus increasing the congestion level in the horizontal link. This means that
one must be cautious when using methods which require real-time learning (such as RL) in online
scenarios like traffic control.</p>
      <p>We are extending the ITSUMO tool to make possible the simulation of goal-driven drivers in
order to test scenarios with co-learning of traffic lights and drivers in a microscopic simulation
model.
The joint project “Caracterizac¸a˜o de Estrat´egias de Controle em Sistemas Multiagentes
Heterogˆeneos” is supported by the bilateral agreement between Brazil and Portugal (CAPES-GRICES).
Ana Bazzan partially supported by the Alexander v. Humboldt Foundation and by CNPq; Bruno
C. da Silva is supported by CNPq, and Denise de Oliveira is supported by CAPES.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>A. L. C.</given-names>
          </string-name>
          <article-title>A distributed approach for coordination of traffic signal agents</article-title>
          .
          <source>Autonomous Agents and Multiagent Systems 10, 1 (March</source>
          <year>2005</year>
          ),
          <fpage>131</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>A. L. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Oliveira</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Klu¨gl, F., and
          <string-name>
            <surname>Nagel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Effects of co-evolution in a complex traffic network. (submitted).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Camponogara</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Jr.</surname>
          </string-name>
          , W. K.
          <article-title>Distributed learning agents in urban traffic control</article-title>
          .
          <source>In EPIA</source>
          (
          <year>2003</year>
          ),
          <string-name>
            <given-names>F.</given-names>
            <surname>Moura-Pires</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Abreu</surname>
          </string-name>
          , Eds., pp.
          <fpage>324</fpage>
          -
          <lpage>335</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>S. P. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeung</surname>
            , D.-Y., and Zhang,
            <given-names>N. L.</given-names>
          </string-name>
          <article-title>Hidden-mode markov decision processes for nonstationary sequential decision making</article-title>
          .
          <source>In Sequence Learning - Paradigms</source>
          , Algorithms, and
          <string-name>
            <surname>Applications</surname>
          </string-name>
          (London, UK,
          <year>2001</year>
          ), Springer-Verlag, pp.
          <fpage>264</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>de Oliveira</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>A. L. C.</given-names>
          </string-name>
          <article-title>Traffic lights control with adaptive group formation based on swarm intelligencetraffic lights control with adaptive group formation based on swarm intelligence</article-title>
          .
          <source>In Proceedings of the 5th International Workshop on Ant Colony Optimization and Swarm Intelligence</source>
          ,
          <string-name>
            <surname>ANTS</surname>
          </string-name>
          <year>2006</year>
          (Brussels,
          <year>September 2006</year>
          ),
          <string-name>
            <given-names>M.</given-names>
            <surname>Dorigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Gambardella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Birattari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martinoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Poli</surname>
          </string-name>
          , and T. Stuetzle, Eds., Lecture Notes in Computer Science, Springer, pp.
          <fpage>520</fpage>
          -
          <lpage>521</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Doya</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samejima</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katagiri</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kawato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Multiple model-based reinforcement learning</article-title>
          .
          <source>Neural Computation</source>
          <volume>14</volume>
          ,
          <issue>6</issue>
          (
          <year>2002</year>
          ),
          <fpage>1347</fpage>
          -
          <lpage>1369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Nagel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Schreckenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>A cellular automaton model for freeway traffic</article-title>
          .
          <source>Journal de Physique I</source>
          <volume>2</volume>
          (
          <year>1992</year>
          ),
          <fpage>2221</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Nunes</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Oliveira</surname>
          </string-name>
          , E. C.
          <article-title>Learning from multiple sources</article-title>
          .
          <source>In Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multi Agent Systems</source>
          , AAMAS (New York, USA,
          <year>July 2004</year>
          ), vol.
          <volume>3</volume>
          , New York, IEEE Computer Society, pp.
          <fpage>1106</fpage>
          -
          <lpage>1113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>A. L. C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lesser</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <article-title>Using cooperative mediation to coordinate traffic lights: a case study</article-title>
          .
          <source>In Proceedings of the 4th International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS)</source>
          (
          <year>July 2005</year>
          ), New York, IEEE Computer Society, pp.
          <fpage>463</fpage>
          -
          <lpage>470</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>B. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
          </string-name>
          , D. d.,
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>A. L. C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Basso</surname>
            ,
            <given-names>E. W.</given-names>
          </string-name>
          <article-title>Adaptive traffic control with reinforcement learning</article-title>
          .
          <source>In Proceedings of the 4th Workshop on Agents in Traffic and Transportation (AAMAS</source>
          <year>2006</year>
          ) (May
          <year>2006</year>
          ),
          <string-name>
            <given-names>A. L. C.</given-names>
            <surname>Bazzan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Draa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Kl</surname>
          </string-name>
          , Eds., pp.
          <fpage>80</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>B. C. d.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basso</surname>
            ,
            <given-names>E. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>A. L. C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Engel</surname>
            ,
            <given-names>P. M.</given-names>
          </string-name>
          <article-title>Dealing with non-stationary environments using context detection</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on Machine Learning (ICML</source>
          <year>2006</year>
          ) (
          <year>June 2006</year>
          ),
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Moore</surname>
          </string-name>
          , Eds., ACM Press, pp.
          <fpage>217</fpage>
          -
          <lpage>224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>B. C. d.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Junges</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
          </string-name>
          , D. d., and
          <string-name>
            <surname>Bazzan</surname>
            ,
            <given-names>A. L. C.</given-names>
          </string-name>
          <article-title>ITSUMO: an intelligent transportation system for urban mobility</article-title>
          .
          <source>In Proceedings of the 5th International Joint Conference On Autonomous Agents And Multiagent Systems (AAMAS</source>
          <year>2006</year>
          )
          <article-title>-</article-title>
          Demonstration
          <string-name>
            <surname>Track</surname>
          </string-name>
          (May
          <year>2006</year>
          ),
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          and G. Weiss, Eds., ACM Press, pp.
          <fpage>1471</fpage>
          -
          <lpage>1472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          <article-title>Planning by incremental dynamic programming</article-title>
          .
          <source>In Proceedings of the Eighth International Workshop on Machine Learning</source>
          (
          <year>1991</year>
          ), Morgan Kaufmann, pp.
          <fpage>353</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>