=Paper=
{{Paper
|id=Vol-2701/paper_2
|storemode=property
|title=A Reinforcement Learning Approach with Fourier Basis Linear Function Approximation for Traffic Signal Control
|pdfUrl=https://ceur-ws.org/Vol-2701/paper_2.pdf
|volume=Vol-2701
|authors=Theresa Ziemke,Lucas N. Alegre,Ana L. C. Bazzan
|dblpUrl=https://dblp.org/rec/conf/ecai/ZiemkeAB20
}}
==A Reinforcement Learning Approach with Fourier Basis Linear Function Approximation for Traffic Signal Control==
<pdf width="1500px">https://ceur-ws.org/Vol-2701/paper_2.pdf</pdf>
<pre>
     A reinforcement learning approach with Fourier basis
     linear function approximation for traffic signal control
                                Theresa Ziemke 1 and Lucas N. Alegre and Ana L. C. Bazzan 2


Abstract. Reinforcement learning is an efficient, widely used ma-              backs of this approach: First, a lot of memory is necessary to keep
chine learning technique that performs well when the state and ac-             large tables when the number of state-action pairs is huge, which
tion spaces are reasonable. This is rarely the case regarding control-         tends to be the case in real-world applications. Second, a long explo-
related problems, as for instance controlling traffic signals. Here, the       ration time is required to fill such tables accurately. Those authors
state space can be very large. In order to deal with the curse of dimen-       then suggest that generalization techniques may help in addressing
sionality, a rough discretization of such space can be used but this is        this so-called curse of dimensionality.
effective just up to a certain point. A way to mitigate this is to use            An efficient representation of the states is a key factor that may
techniques that generalize the state space such as function approxi-           limit the use of the standard RL algorithms in problems that involve
mation. In this paper, a linear function approximation is used. Specif-        several agents. Moreover, in scenarios in which the states are repre-
ically, SARSA(λ) with Fourier basis features is implemented to                 sented as continuous values, estimation of the state value by means
control traffic signals in the agent-based transport simulation MAT-           of tabular Q-values may not be feasible. To deal with this problem,
Sim. The results are compared not only to trivial controllers such as          in this paper a true online SARSA(λ) algorithm with Fourier Basis
fixed-time, but also to state-of-the-art rule-based adaptive methods.          linear function approximation is used. As discussed ahead, this op-
It is concluded that SARSA(λ) with Fourier basis features is able to           tion is based on the fact that non-linear function approximation has
outperform such methods, especially in scenarios with varying traffic          several drawbacks.
demands.                                                                          The RL-based adaptive signal control algorithm was implemented
                                                                               in the open-source agent-based transport simulation MATSim [12].
                                                                               In MATSim, it is possible to investigate the impact of the RL-based
1     Introduction                                                             adaptive signal control algorithm and compared it to other fixed-time
Traffic signal control is a challenging real-world problem. Current            or adaptive signal control methods. For comparison, we run our ap-
solutions to this problem, such as adaptive systems like SCOOT [13],           proach against a rule-based adaptive signal control algorithm based
are often centralized or at least partially centralized if each controller     on Lämmer and Helbing [16], which was implemented in MATSim
is in charge of a portion of the urban network. Alternatives are man-          in a previous study [15, 23]. The results show that the RL-based ap-
ual interventions from traffic operators or the use of fixed-time sig-         proach is able to outperform these approaches in a single intersec-
nal plans. However, in the era of big data and advanced computing              tion scenario. This is especially notable, as these approaches were
power, other paradigms are becoming more and more prominent.                   designed specifically for dealing with the control of signals, whereas
Among these we find those derived from machine learning in gen-                the RL-based approach needs no domain knowledge. To the authors
eral, and reinforcement learning (RL) in particular. In RL, traffic sig-       best knowledge, virtually no other work in the literature (especially
nal controllers located at intersections can be seen as autonomous             those stemming from the RL area) includes such kind of comparison.
agents that learn while interacting with the environment.                      More often than not, comparison of RL approaches is made only to
   The use of RL is associated with challenging issues: the environ-           a fixed-time scheme.
ment is dynamic (and thus agents must be highly adaptive), agents                 The remaining of this paper is organized as follows. The next sec-
must react to changes in the environment at individual level while             tion discusses background and related work. Sect. 3 describes the two
also causing an unpredictable collective pattern, as they act in a cou-        adaptive signal control approaches used in this study: The rule-based
pled environment. Therefore, traffic signal control poses many chal-           signal control based on Lämmer and the RL-based approach. Exper-
lenges for standard techniques of multiagent learning.                         iments and results are presented in Sect. 4, whereas Sect. 5 contains
   To understand these challenges, let us first discuss the single agent       a discussion of the results and future work.
case, where one agent performs an action once in a given state, and
learns by getting a signal (reward) from the environment. To put it
simply, RL techniques are based on estimates of values for state-              2     Background and related work
action pairs (the so-called Q-values). These values may be repre-
                                                                               2.1    Traffic signal control
sented as a table with one entry for each state-action pair. This works
well in single agent problems and/or when the number of states and             In contrast to fixed-time signals that cyclically repeat a given sig-
actions is small. However, in [22] Sutton and Barto discuss two draw-          nal plan, traffic-responsive signals react to current traffic by adjust-
1 Technische Universität Berlin, Germany, email: tziemke@vsp.tu-berlin.de     ing signal states based on sensor-data (e.g., from upstream inducting
2    Universidade Federal do Rio Grande do Sul, Brazil, email:                 loops or cameras). They can, therefore, react to changes in demand
    {lnalegre,bazzan}@inf.ufrgs.br                                             and reduce emissions and waiting times more efficiently.

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
   A variety of traffic-responsive signal control algorithms have been               Using RL for traffic signal control is especially promising, as one
developed. An overview is given, e.g., by Friedrich [6]. Different lev-           does not need a lot of domain knowledge (as opposed to, e.g., rule-
els of adjustment are distinguished: actuated signals use a fixed-time            based approaches); rather, the controller learns a policy by itself.
base plan and adjust parameters like green split, cycle time or off-              However, issues may arise with the aforementioned curse of dimen-
set. (Fully) adaptive signals decide about the signal states on the fly.          sionality. In fact, depending on the specific formulation (e.g., how
They can modify phase orders or even combine signals into different               states and action spaces are defined), the search space can be very
phases over time. With this, the flexibility of the signal optimization           high. For instance, consider an intersection with four incoming ap-
is augmented, which increases the possible improvement, but makes                 proaches with three lanes per approach. If we define the state as the
the optimization problem more complex. In order to reduce complex-                queue length for each lane discretized in 10 levels, we would have
ity and communication effort between sensors and a central com-                   (4 × 3)10 distinct possible states. The reader is referred to [30] for
putation unit (which controls signal states system-wide), decentral-              several variants of such formulations.
ized (also called self-controlled) methods decide locally about signal               In [18, 20, 21], RL is used by traffic signals in order to learn a pol-
states without complete knowledge of the system. Usually, every sig-              icy that maps states (normally queues at junctions) to actions (nor-
nalized intersection has its own processing unit that accounts for up-            mally keeping/changing the current split of green times among the
stream (and sometimes downstream) sensor data of all approaches.                  lights of each phase). In [21] the approach is centralized (a single
A challenge of decentralized systems is to still ensure system-wide               entity holds the MDP for all traffic signals); a central authority re-
stability, especially when dealing with oversaturated conditions. A               ceives information about the length of the queues and elapsed time
number of methods were developed that tackle these challenges.                    from various lanes to make a decision about timings at each signal.
   Examples of traffic-responsive approaches from various genera-                 On the other hand, the approaches in [18] and [20] are decentralized.
tions and technological basis are: SCOOT [13] SCATS [17]; TUC                     Each junction learns independently (normally using QL).
(Traffic-responsive Urban Traffic Control) [5]; and TUC combined                     Since most of these works use QL, and thus approximate the Q-
with predictive control [4]. Some can be considered as rule-based as              function as a table, they may fall prey to the curse of dimensionality.
for example Lämmer and Helbing [16]), while others use techniques                This arises when one deals with realistic scenarios, as, e.g., those
from RL and model signals as learning agents (see Sect. 2.2).                     beyond 2-phase intersections that are common in the literature.
                                                                                     In order to address this, a few works used function approximation.
                                                                                  For instance, [1] uses tile coding in function approximation. How-
2.2    Reinforcement learning                                                     ever, the definition of states only consider queue length.
                                                                                     Recently, many studies have achieved impressive results using
In RL, an agent’s goal is to learn an optimal control policy π ∗ , which
                                                                                  deep neural networks to approximate the Q-function (e.g., DQN
maps a given state to the best appropriate action by means of a value
                                                                                  [19, 24, 31]). However, linear function approximation has guaranteed
function. We can model RL as a Markov decision process (MDP)
                                                                                  convergence and error bounds, whereas non-linear function approxi-
composed by a tuple (S, A, T, R), where S is a set of states; A is a
                                                                                  mation is known to diverge in multiple cases [2, 22]. Moreover, linear
set of actions; T is the transition function that models the probability
                                                                                  function approximation is less computation-intensive, as it relies on
of the system moving from a state s ∈ S to a state s0 ∈ S, upon
                                                                                  a significantly fewer number of parameters. Thus, if the Q-function
performing action a ∈ A; and R is the reward function that yields a
                                                                                  can be linearly approximated with sufficient precision, linear func-
real number associated with performing an action a ∈ A when one
                                                                                  tion approximation methods are preferable.
is in state s ∈ S. An experience tuple hs, a, s0 , ri denotes the fact
that the agent was in state s, performed action a and ended up in s0
with reward r. Let t denote the tth step in the policy π. In an infinite          2.3    Transport simulation
horizon MDP, the cumulative reward in the future under policy π is
defined by the Q-function, Eq 1, where γ ∈ [0, 1] is the discount                 As deployment, operations, and maintenance costs of traffic-
factor for future rewards.                                                        responsive signals in general are high, transport simulation tools pro-
                                                                                  vide a perfect environment to systematically test and evaluate new
                             ∞
                            hX                                      i             signal control methods before applying them in the field.
            Qπ (s, a) = E             γ τ rt+τ |st = s, at = a, π       (1)          The agent-based transport simulation MATSim [12], which is used
                               τ =0                                               in this study, is especially suitable in this regard, as it is able to
    Since the agent’s objective is to maximize the cumulative reward,             run large-scale real-world simulations in reasonable time as. Sim-
if it learned the optimal Q-values Q∗ (s, a) for all state-actions pairs,         ulations can be build based on open data (see, e.g., the open Berlin
then the optimal control policy π ∗ is as follows3 :                              scenario [32]) such that the impact of new signal control approaches
                                                                                  can be easily analyzed for arbitrary scenarios4 and compared to other
             π ∗ (s) = argmax Q∗ (s, a) ∀s ∈ S, a ∈ A.                  (2)       control methods. Because of its agent-based structure, agent-specific
                           a                                                      waiting times and varying queue lengths over time at traffic lights
                                                                                  can be directly analyzed and compared.
   RL methods can be divided into two categories: Model-based
                                                                                     In MATSim traffic is modeled by agents (i.e., persons) that follow
methods assume that the transition function T and the reward func-
                                                                                  a daily plan of activities and trips. Traffic flow is modelled mesoscop-
tion R are available, or instead try to learn them. Model-free meth-
                                                                                  ically by spatial first-in-first-out (FIFO) queues. Vehicles at the head
ods, on the other hand, do not require that the agent have access to
                                                                                  of a queue can leave a link when the following criteria are fulfilled:
information about how the environment works.
                                                                                  (1) The link’s free-flow travel time has passed, (2) the flow capac-
   There are many studies that use RL to improve traffic signal per-
                                                                                  ity of the link is not exceeded in the given time step, and (3) there
formance. Due to space restrictions, we refer the reader to some sur-
vey papers (with different purposes): [3, 18, 29, 30].                            4 An example on how to start a MATSim simulation using the RL signal
                                                                                    control presented in this paper can be found at http://matsim.org/
3 For converge guarantees, in the case of QL, please see [27].                      javadoc → signals → RunSarsaLambdaSignalsExample.


                                                                              2
                                                                               vehicles that are registered by sensors. Given a prediction of the ex-
                                                                               pected queue length n̂i (t, τ ) at time τ > t and the maximum outflow
                                                                               rate qimax for phase i, one can derive the expected required green
                                                                               time ĝi (t, τ ) for clearing the queue at time t using ĝi (t, τ ) = n̂qimax
                                                                                                                                                          (t,τ )
                                                                                                                                                                 .
                                                                                                                                                         i
                                                                               With this, the priority index is calculated as follows:
             (a) Graph structure of a link with multiple lanes.


                                                                                                                  n̂i (t, τ )
                                                                                                  
                                                                                                  
                                                                                                     max                         ,      if i = σ(t)
                                                                                                  τ (t)≤τ
                                                                                                      i    ≤τ 0 τ
                                                                                                                i
                                                                                                                  +   ĝi (t, τ )
                                                                                       πi (t) =                                                               (3)
                                                                                                              n̂i (t, τi0 )
                                                                                                                                    ,    if i 6= σ(t) .
                                                                                                  
                                                                                                  
                                                                                                   pen
           (b) Multiple queues; spillback is captured correctly.                                    τσ(t) (t) + τi0 + ĝi (t, τi0 )
Figure 1: Links with multiple lanes in MATSim. Each lane is rep-
resented by its own FIFO queue. Traffic signal control for different
turning moves is captured. Vehicles on different lanes can pass each           Two cases are distinguished depending on whether the phase i is al-
other, unless the queue spills over. Source: [9]                               ready active or not. In either case, the equation basically divides the
                                                                               number of vehicles by the time needed to clear the queue including
is enough space on the next link. Despite this simplistic modeling             the (remaining) intergreen time. The priority index can, therefore, be
approach, congestion as well as spillback can be modeled.                      interpreted as a clearance efficiency rate. τ includes either the effect
   The traffic signal control module was developed by Grether as an            of remaining intergreen time for the selected phase (when it has not
extension to MATSim [10]. If a signal exists on a link, leaving the            yet switched to green), or a lookahead beyond the end of the current
link is not possible while it shows red. First studies focused on fixed-       queue. It is bounded from below by the remaining intergreen time
time signals, but also approaches for traffic-responsive signal control        τi (t), since that time, if larger than zero, will be incurred before traf-
have been implemented [8, 15, 23]. Separated waiting queues at in-             fic can flow, and from above by the full intergreen time τi0 , since be-
tersections can be modeled in MATSim by lanes (see Fig. 1), which              yond that it is possible to just switch back from some other state. For
is especially useful to model protected left turns. Signals and lanes in       a non-active phase (i.e., i 6= σ(t)), the priority index is reduced by a
MATSim are more extensively described by Grether and Thunig [9].                                     pen
                                                                               canceling penalty τσ(t)    (t). This prevents the optimizing regime from
   Events of vehicles entering or leaving links and lanes are thrown           frequently switching signal phases. The penalty can be interpreted
on a second-by-second time resolution in the simulation. Sensors on            as the average additional waiting time for vehicles at the previously
links or lanes that detect single vehicles can be easily modeled by            served links that would occur upon cancellation. The priority index
listening to these events. As in reality, the maximum forecast period          as it is defined in Eq. 3 assumes that each signal phase only serves
of such sensors is limited – vehicles can only be detected when they           one link – which is why phases and links are both denoted by i here.
have entered the link. If a link is short, forecasts might not be ac-          The algorithm was further extended to be able to deal with realistic
curate. In the simulation, responsive signals use these sensor data to         traffic situations like lanes, phase combination with opposing traf-
react dynamically to approaching vehicles. For every signalized in-            fic, minimum green times, and overload. Since this extensions make
tersection, the control unit is called every second to decide about cur-       the equation less readable while the main method stays the same, the
rent signal states. With that, also RL-based signal control approaches         authors refer to Thunig et. al [23] for more details.
can be easily installed into the simulation framework.                             An enclosing stabilizing strategy ensures that each link is at least
   In general, MATSim can model user reaction as route, mode or                served once during a specified minimal service interval to prevent
departure time changes. But for this paper, only the traffic flow simu-        spillbacks. Links that have to be stabilized are added to a stabilization
lation of MATSim is used. Readers interested in the evolutionary part          queue. If the queue is non-empty, the phase corresponding to the first
of MATSim – i.e., how agents adapt their plans and how long-term               element of the queue is switched to green for a guaranteed green time
effects can be analyzed – are referred to [12].                                gis depending on the average capacity utilization. If the stabilization
                                                                               queue is empty, the optimizing strategy takes over. Lämmer’s control
3     Methods                                                                  claims to provide intrinsic green waves and locally optimal service,
                                                                               which also results in system-wide optimal service.
In this section, the two adaptive signal control approaches used in this           An assumption of Lämmer’s algorithm is the queue-
study are presented: The rule-based signal control algorithm based on          representation of traffic flow: If a link i is served, vehicles can
Lämmer and the RL approach based on true online SARSA(λ) with                 leave the link with a constant outflow rate qimax , which is assumed
linear function approximation.                                                 to be known. Additionally, queues are assumed to be non-spatially,
                                                                               i.e., the algorithm does not account for vehicles spilling back to
                                                                               upstream lanes or links. Demand is supposed to be manageable on
3.1    Lämmer’s rule-based adaptive traffic signal
                                                                               average with the desired cycle time T to ensure stability.
       control algorithm
                                                                                   Two sensors are used to predict the number of waiting vehicles
The idea of the self-controlled signals proposed by Lämmer and Hel-           per link and time. One is positioned at the end of the link to detect
bing [16] is to minimize waiting times and queue lengths at decen-             waiting and outflowing vehicles; the second one is located further
tralized intersections while also granting stability through minimal           upstream to detect approaching vehicles. Assuming free flow condi-
service intervals. The algorithm combines two strategies. The opti-            tions at link i, one can estimate the length of the queue ni (t) at time
mizing strategy selects the signal phase i to be served next as the            t and predict the expected queue length n̂i (t, τ ) at a time τ > t.
one with the highest priority index πi (see Eq. 3), which takes into           While the estimation of queue lengths allows uncertainty, the mere
account outflow rates and queue lengths of waiting and approaching             presence of a queue is definite.

                                                                           3
3.2    True online SARSA(λ) traffic signal controller                           State Space In RL problems, the definition of state space strongly
                                                                               influences the agents’ behavior and performance. In traffic signal
The proposed RL traffic signal controller implements the true on-              control, for instance, information related to the level of congestion
line SARSA(λ) algorithm [26], a modification of the traditional                in the approaching lanes is fundamental in order to appropriately
SARSA(λ) that was demonstrated to have better theoretical prop-                choose the next active signal phase.
erties and outperform the original method [25]. As detailed later, we             In the present setting, the agent observes a vector st ∈ Rk at
use two kinds of features, thus impacting the space state. In order to         each time step t. This vector partially represents the true state of the
deal with high dimensional state spaces, the Q-function was linearly           controlled intersection and is defined as in Eq. 8, where E is the set of
approximated using the Fourier basis scheme [14].                              all links of the intersection and L is the set of all approaching lanes,
   When linear approximation is used, the Q-values Q(s, a) for each            ρi ∈ {0, 1} is a binary feature active when i is the current selected
discrete action a are approximated as a weighted sum of a set of               signal phase, τ ∈ [0, 1] is the elapsed time of the current signal phase
m basis functions φ1 , ..., φm , as in Eq. 4, where θ is the vector of         divided by the maximal green time gmax , the density ∆e ∈ [0, 1] is
weights.                                                                       defined as the number of vehicles on link e ∈ E divided by it’s
                                          m
                                                                               storage capacity and ql ∈ [0, 1] is defined as the number of queued
                                         X
               Q(s, a) = θ · φ(s, a) =        θi φi (s, a)         (4)
                                               i=1                             vehicles on lane l ∈ L divided by the storage capacity of the lane.
   The Fourier series is one of the most commonly used continuous
function approximation methods, presenting solid theoretical foun-                        st = [ρ1 , ..., ρ|σ| , τ, q1 , ..., q|L| , ∆1 , ..., ∆|E| ]    (8)
dations. In [14], it was empirically shown that Fourier basis outper-          This state definition is inspired by [7], where authors achieved similar
forms other commonly used approximations methods such as poly-                 performance levels, even when using more complex state definitions
nomial and radial basis functions in continuous RL domains.                    (e.g., including positions of each vehicle in the approaching lanes).
   When applying Fourier series to the RL setting, it is possible to              As common in the literature, the proposed RL signal control is
drop the sin terms of the series5 . Then, for a nth order Fourier ap-          only called every three seconds. This means, that one time step for
proximation, each basis function φi is defined as in Eq. 5, where              the traffic signal agent corresponds to three seconds of simulation.
ci = [c1 , ..., ck ] is a vector that attaches an integer coefficient          This reduces the complexity and the size of the state space, without
c1≤j≤k ∈ [0, ..., n] to each feature in s, and k is the dimension of           significantly reducing the performance.
the state space.
                            (
                              cos(πci · s), if a = at                           Action Space At each time step t (every three seconds), the traffic
                φi (s, a) =                                       (5)          signal controller chooses a discrete action at ∈ A. In our setting, the
                              0,             if a 6= at
                                                                               number of actions is equal to the number of possible signal phases,
The basis set of functions φ1 , ..., φm is obtained by systematically          therefore, |A| = |σ|. There are two restrictions in the action selec-
varying these coefficients. Each coefficient determines the basis              tion: the agent can change the current active signal phase only if the
function’s frequency along its dimension. Furthermore, as we in-               elapsed time is greater or equal than the minimal green time gmin
crease the order n of the approximation, more frequencies are used.            and keep it only if the elapsed time is less than the maximal green
   After the execution of action at , the weights θ are updated via            time gmax . These restrictions ensure the feasibility of the signal con-
gradient descent, following the true online SARSA(λ) with linear               troller for real-world applications.
function approximation update rule, as in Eq. 6, where δ = rt +
γQ(st+1 , at+1 )−Q(st , at ) is the temporal difference error and Qold          Reward After taking action at , the traffic signal controller re-
is a scalar temporary variable initialized with zero and set to Qold ←         ceives a scalar reward rt ∈ R. As in [7] the reward is defined as
Q(st+1 , at+1 ) after every step.                                              the change in cumulative delay, as given in Eq. 9, where Dat and
                                                                               Dat+1 represent the cumulative delay at the intersection before and
           θ ← θ + α(δ + Q − Qold )e − α(Q − Qold )φ                (6)
                                                                               after executing the action at .
   The eligibility traces vector e – which is used to address the credit
assignment problem – is updated as in Eq. 7. Each weight update                                           rt = Dat − Dat+1                               (9)
also takes into account previously visited states, which are credited
accordingly to the values accumulated on the vector e. The parameter             On its turn, the cumulative vehicle delay D, for any time t, is com-
λ controls the decay of the eligibility traces at each time step.              puted as in Eq. 10, where Vt is the set of vehicles on incoming ap-
                                                                               proaches and dvt is the delay of vehicle v at time t.
                      e ← γλe + φ − αγλ(eT φ)φ                      (7)                                                X
                                                                                                              Dt =           dvt                        (10)
   Given the base learning rate α, each weight θi is updated with the                                                 v∈Vt
scaled learning rate αi = α/||ci ||2 , as proposed in [14]. Both the
weights and eligibility traces vectors are initialized with zeros.
   In order to address the exploration–exploitation dilemma, the ε-            4     Experiments and results
greedy exploration strategy is used to choose actions: the action with
the highest Q-value is selected with a probability of 1 − ε and a              4.1    Scenario
random action is selected with probability ε.
   Next we give the formulations that are specific to the domain of            This study focuses on a single intersection scenario with two set-ups,
signal control.                                                                each for a different demand level. In both, the RL control is compared
                                                                               to a fixed-time signal control and rule-based traffic-responsive signal
5 For detailed explanation, please see [14].
                                                                               control based on [16] (as introduced in Sect. 3.1).

                                                                           4
                                                                                 4.2     Results
                                                                                 The proposed method of the true online SARSA(λ) with Fourier
                                                                                 basis linear function approximation for signal control is applied to
                                                                                 the single intersection scenario presented in the previous section and
                                                                                 compared to RL signal control methods with other configurations
                                                                                 (in Sect. 4.2.1, where our method is compared to a tabular variant),
                                                                                 as well as to a fixed-time and a rule-based adaptive signal control
                                                                                 approach (in Sect. 4.2.2).
                                                                                    Due to the stochastic arrival rates, results presented here are aver-
                                                                                 aged over 20 runs with different random seeds, whereby the random
                                                                                 seed influences the platoon structure of approaching vehicles (the av-
                                                                                 erage flow rate stays the same).
                Figure 2: Single intersection scenario.                             The shadowed area in the plots shown ahead depicted the standard
                                                                                 deviation regarding average delay or total queue length, accordingly.
                                                                                 The lines were smoothed with a moving average window of 300 sec-
                                                                                 onds (i.e., 5 minutes) for better clarity.
 Traffic signals. The single intersection featured here (see Fig. 2)
has four incoming approaches. In the horizontal direction, there is a
dedicate left turning lane in each traffic approach, as well as three            4.2.1    Comparison with other RL signal control methods
lanes for straight traffic. In the vertical direction, there are two lanes
for straight traffic.                                                            Here we compare the proposed method with the traditional tabular
   Traffic signals are grouped into three non-conflicting signal                 SARSA(λ) [22], using the first set-up of the scenario presented in
phases: Straight traffic in horizontal direction; left turning traffic in        Sect. 4.1. We also discuss optimal settings regarding the order of the
horizontal direction; vertical direction. While switching between two            Fourier basis approximation, state and reward.
signal phases, there is an all red period of one second. The minimum
green time for a signal phase is five seconds.                                    Tabular vs. linear SARSA(λ). In order to transform the con-
   The fixed-time control that is used for comparison purposes is op-            tinuous state space defined in Sect. 3.2 to a discrete state space for
timized by Webster’s method [28]. It has a cycle time of 40 sec-                 the tabular SARSA(λ), the queue q and density ∆ attributes were
onds and distributes green times according to average flow rates. The            discretized in equally distributed bins/intervals. The binary features
traffic-responsive signal approaches do not have a fixed cycle time:             ρi for each phase are already discrete and the feature τ has a finite
For Lämmer’s control algorithm, a desired and a maximal cycle time              number of possible values as the elapsed time increases in steps of
can be defined (for this scenario 40 and 60 seconds are used, respec-            five seconds; therefore, they did not need to be discretized.
tively). For the RL control a maximal green time of 30 seconds per                   In order to allow a fair comparison, the same discount factor, value
signal phase is used. As mentioned in Sect. 3.2, the RL control is               of λ and exploration rate were used for both methods. The discount
only called every three seconds to decide about new signal states.               factor was set to γ = 0.95, λ = 0.1 and the exploration rate was
                                                                                 set to ε = 0.01 (this latter means that the agent is mostly taking
                                                                                 the action with the highest Q-value, but still exploring with a fixed
                                                                                 low chance). For the tabular SARSA(λ), a learning rate of α = 0.1
                                                                                 was used, while for true online SARSA(λ) with linear function ap-
  Demand. In a first set-up, there is traffic going straight in the              proximation, α = 10−6 was used. These values are common in the
horizontal direction, with 1800 vehicles approaching on average per              literature and produced the best results for each method after exten-
hour, in each of the two approaches. In the vertical direction, there are        sive experimentation with different values.
600 vehicles on average per hour from each side – all going straight.                As the state space in this case is large, and the number of Fourier
Additionally, there are 180 vehicles on average per hour from both               basis functions grows exponentially on the number of dimensions of
sides in horizontal direction that want to turn left at the intersection.        the state space, it is necessary to restrict the number of basis. We can
A period of 86,400 seconds (i.e., one day) is simulated.                         meet this condition by placing constraints on the coefficient vectors
   In a second set-up, the demand is doubled during five time periods            ci . However, in this setting, adding coefficients with more than two
over the day of 2,000 seconds length each, in order to analyze the               non-zero elements did not improve the results. Thus, we further lim-
effect of fluctuating demand on the performance of the RL controller.            ited each coefficient vector ci to have at most two non-zero elements.
To be more precise, in the time intervals [0, 2,000), [20,000, 22,000),              In Fig. 3 the average delay per vehicle at each second of the simu-
[40,000, 42,000), [60,000, 62,000), and [80,000, 82,000) the average             lation is depicted for true online SARSA(λ) with Fourier basis lin-
flow rates in horizontal direction are 3600 vehicles per hour going              ear function approximation and for tabular SARSA(λ) with 8 vs. 10
straight and 360 vehicles per hour going left per approach, whereas in           discretization bins of the q and ∆ features.
vertical direction the average flow rate per approach is 1200 vehicles               With 10 bins, the learning is very slow, as the number of discretiza-
per hour. During the rest of the simulation, the average flow rates are          tion bins exponentially increases the size of the state space. Reducing
the same as for the first scenario set-up.                                       the number of bins to 8 significantly speeds up the learning and re-
   In both set-ups, arrival rates are stochastic: vehicles are inserted as       duces the delay. However, by reducing the number of bins, different
platoons with a platoon size that is exponentially distributed around            states (in which different actions are optimal) are perceived as the
an expected value of five. Also the time gap between vehicle pla-                same, thus leading to sub-optimal performance in the long run.
toons is exponentially distributed; its expected value is the platoon                The usage of function approximation not only avoids the curse of
size divided by the average flow value.                                          dimensionality, but introduces generalization, i.e., when updating the

                                                                             5
                           350
                                                Tabular SARSA(λ) - 10 bins                                                                                                     State with density
                           300                                                                                                 40
                                                Tabular SARSA(λ) - 8 bins                                                                                                      State without density
   Average delay (s/veh)


                                                                                                       Average delay (s/veh)
                           250                  True online SARSA(λ) w/ Fourier basis approx.
                                                                                                                               30
                           200

                           150                                                                                                 20

                           100
                                                                                                                               10
                            50

                                0                                                                                               0
                                        0   20000        40000           60000      80000                                               0       20000        40000           60000         80000
                                                       Time of day (s)                                                                                    Time of day (s)

Figure 3: Average delay for tabular and linear function approximation                                                                        Figure 5: Impact of state definition
RL implementations.

                                                                                                                               600
                                                                                                                                                         Change in cumulative delay
                                                                                       n=3                                                               Change in total queue length
                                                                                                                               500


                                                                                                       Average delay (s/veh)
                           30
                                                                                       n=5
   Average delay (s/veh)


                           25                                                          n=7                                     400
                                                                                       n=9
                           20                                                                                                  300

                           15                                                                                                  200
                           10                                                                                                  100
                            5
                                                                                                                                    0
                                                                                                                                        0        20000        40000          60000         80000
                            0
                                    0       20000        40000           60000      80000                                                                  Time of day (s)
                                                      Time of day (s)
                                                                                                                                            Figure 6: Impact of reward definition
Figure 4: Impact of different values for the order n of the Fourier
basis approximation.                                                                                using change in cumulative delay as reward not only converges to a
                                                                                                    better performance, but produces a learning curve that decreases or-
Q-function after taking an action in a given state, similar states are                              ders of magnitude faster. This result shows that the choice of which
also affected and have their Q-values changed. With that, the true                                  reward function to use is one of the most critical implementation de-
online SARSA(λ) with Fourier basis linear function approximation                                    cisions when designing a reinforcement learning controller.
results in a much faster learning curve and overall lower delay values.
                                                                                                    4.2.2                           Comparison with fixed-time and rule-based signals
 Order of the Fourier basis approximation. Fig. 4 shows the im-
pact that the value of the Fourier approximation order n has on the                                 In this section, the true online SARSA(λ) with Fourier basis lin-
agent’s performance. As expected, the higher the value of n, the more                               ear function approximation is compared to fixed-time and rule-based
accurate is the approximation of the Q-function. Changing the order                                 adaptive signals in both set-ups of the single intersection scenario.
from n = 3 to n = 9 results in a notable reduction on average delay;                                   Fig. 7 shows the performance regarding average delay and total
however, when n is sufficiently high (n = 7 and n = 9), there is                                    queue length for the first set-up (constant average flow rates). It can
no further improvement. For this reason, the Fourier approximation                                  be seen that for this, somewhat homogeneous setup, RL and Lämmer
order is fixed to n = 7 for all following experiments.                                              perform much better than Webster fixed-time control in terms of av-
                                                                                                    erage delay and queue length. Also, they produce less variation in
                                                                                                    these measures, demonstrating robustness against traffic fluctuations.
 State definition. Although the q (flow) features provide the traffic                               Note that for constant average flow rates, the fixed-time control used
signal control agent with queue on each lane, the ∆ (density) features                              here (optimized by Webster’s method) is already quite good. RL is
are also important, as they inform how many vehicles (that may be                                   able to outperform the fixed scheme because it seems to be more sta-
queued in the following seconds) there are on each link. Fig. 5 shows                               ble regarding platoon variations. This can be seen in both plots in
that, by removing the ∆ features from the state definition, the average                             Fig 7, with the standard deviation (shown as the shadowed area in
delay increases. This difference might be even higher in scenarios                                  the plots) being lower for the RL.
with very high demand, where a high number of vehicles are moving                                      Fig. 8 depicts the effects of more heterogeneous demand (second
and approaching the queues.                                                                         set-up), where the flow rates are doubled during five time periods
                                                                                                    over the day (see Sect. 4.1). The RL controller improves its perfor-
 Reward definition. The definition of the reward function has a                                     mance from the second peak onwards, as in the first peak it was expe-
high impact on the performance of the RL controller [11]: In Fig. 6                                 riencing an overload situation for the first time. In this more difficult
the reward function defined in Sect. 3.2 is compared to another re-                                 set-up, the difference in performance between RL and Lämmer be-
ward function found in literature [18], defined as the change in total                              comes less visible, with both presenting the same amount of queue
queue length between successive actions. The traffic signal controller                              length when there is low demand. Additionally, RL decreases the

                                                                                                6
                                                                                                                                         400
                               150               RL           Laemmer             Webster fixed-time                                                           RL           Laemmer             Webster fixed-time
    Average delay (s/veh)


                                                                                                              Average delay (s/veh)
                               125
                                                                                                                                         300
                               100

                                75                                                                                                       200

                                50
                                                                                                                                         100
                                25

                                 0                                                                                                         0
                                      0           20000        40000           60000       80000                                                    0           20000        40000           60000       80000
                                                             Time of day (s)                                                                                               Time of day (s)

                                     (a) Average delay per vehicle during the simulation.                                                          (a) Average delay per vehicle during the simulation.

                                                 RL           Laemmer             Webster fixed-time                                                           RL           Laemmer             Webster fixed-time
                               200                                                                                                       1200
    Total queue length (veh)


                                                                                                              Total queue length (veh)
                                                                                                                                         1000
                               150
                                                                                                                                          800
                               100                                                                                                        600

                                                                                                                                          400
                                50
                                                                                                                                          200

                                 0                                                                                                             0
                                      0           20000        40000           60000       80000                                                        0        20000        40000          60000       80000
                                                             Time of day (s)                                                                                               Time of day (s)

                                          (b) Total queue length during the simulation.                                                                 (b) Total queue length during the simulation.

Figure 7: Single intersection scenario with constant average flow                                          Figure 8: Single intersection scenario with periodically repeating
rates (first set up of Section 4.1)                                                                        time periods where the average flow rates are doubled (second set
                                                                                                           up of Sect. 4.1)
queue lengths faster than Lämmer after the peaks, which indicates
that RL better adapts to different vehicle demands.                                                        with other than fixed-time approaches is rarely in the RL literature
   In summary, despite its slightly lower performance in the first set-                                    and is, therefore, a key feature of this work.
up, the RL control is able to handle overload situations and quickly                                          As a next step, the signal control based on true online SARSA(λ)
reduces the queues afterwards. Recall that, contrarily to rule-based                                       with Fourier basis linear function approximation will be applied to
approaches, the RL control does not require domain knowledge.                                              real-world scenarios using MATSim, and compared to the signal con-
                                                                                                           trol approaches employed here. Given ongoing experimentation, the
                                                                                                           performance of the RL control looks very promising, which would
5              Conclusion and future work                                                                  emphasise its advantage in scenarios that are more challenging. On
In the present paper it was shown that specific techniques from RL                                         the one hand, with multiple intersections, one can assess the effects
can help to improve the performance of traffic signal control, and                                         of the interaction of adjacent intersections as, e.g., when conges-
even outperform state-of-the-art rule-based adaptive signal control                                        tion spills back. On another perspective, this is a multiagent RL task
methods. It was argued that tabular RL methods may not be feasi-                                           that is known to be more challenging. Also, it will be investigated,
ble due to the curse of dimensionality. When it is possible to employ                                      whether the RL signal control can be further improved to handle sit-
them, it is often the case that they need long learning times before                                       uations of overload even better than the current implementation. Fi-
convergence in the case of realistic intersections with more than two                                      nally, since the issue of which reward scheme to use seems to be a
signal phases and when a more complex definition of state is used.                                         key issue, a possible extension of this work could consider using the
Recall that the results presented here show that including more fea-                                       methods for designing a reward function that fits this domain best.
tures (i.e., not only queue but also density) played a significant role
in the performance.
   To address the curse of dimensionality, we used Fourier basis lin-                                      ACKNOWLEDGEMENTS
ear function approximation alongside the true online SARSA(λ)
algorithm, which to the authors best knowledge was not used for traf-                                      The authors thank Prof. Kai Nagel for his support and supervision
fic signal control before. This method was implemented in MATSim                                           during the internship of Lucas N. Alegre at TU Berlin. Thanks also
and compared to optimal fixed-time and rule-based adaptive signal                                          to Dr. Ricardo Grunitzki for a previous version of the tabular Q-
control in a single intersection scenario, in which the demand was                                         learning code. The authors are grateful to the Deutsche Forschungs-
varied. It could be seen that our approach outperforms the fixed-time                                      gemeinschaft (DFG) for funding the project Optimization and net-
controller and is competitive with the rule-based adaptive controller                                      work wide analysis of traffic signal control, as well as to the Brazilian
in terms of average delay and queue length. This kind of comparison                                        Research Council (grant no. 307215/2017-2 for Ana Bazzan).

                                                                                                       7
REFERENCES                                                                       principles, methodology, algorithms’, in Proceedings of the
                                                                                 International Conference on Road Traffic Signalling, Sydney,
 [1] Monireh Abdoos, Nasser Mozayani, and Ana L.C. Bazzan, ‘Hi-                  Australia, (1982).
     erarchical control of traffic signals using Q-learning with tile       [18] Patrick Mannion, Jim Duggan, and Enda Howley, ‘An experi-
     coding’, Appl. Intell., 40(2), 201–213, (2014).                             mental review of reinforcement learning algorithms for adap-
 [2] Leemon Baird, ‘Residual algorithms: Reinforcement learning                  tive traffic signal control’, in Autonomic Road Transport Sup-
     with function approximation’, in Machine Learning Proceed-                  port Systems, eds., Thomas Leo McCluskey, Apostolos Kot-
     ings 1995, 30 – 37, Morgan Kaufmann, (1995).                                sialos, P. Jörg Müller, Franziska Klügl, Omer Rana, and René
 [3] Ana L. C. Bazzan, ‘Opportunities for multiagent systems                     Schumann, 47–66, Springer International Publishing, Cham,
     and multiagent reinforcement learning in traffic control’, Au-              (May 2016).
     tonomous Agents and Multiagent Systems, 18(3), 342–375,                [19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex
     (June 2009).                                                                Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Ried-
 [4] Lucas Barcelos de Oliveira and Eduardo Camponogara, ‘Multi-                 miller, ‘Playing atari with deep reinforcement learning’, in
     agent model predictive control of signaling split in urban traf-            NIPS Deep Learning Workshop, (2013).
     fic networks’, Transportation Research Part C: Emerging Tech-          [20] K. J. Prabuchandran, A. N. Hemanth Kumar, and Shalabh Bhat-
     nologies, 18(1), 120–139, (2010).                                           nagar, ‘Decentralized learning for traffic signal control’, in Pro-
 [5] C. Diakaki, M. Papageorgiou, and K. Aboudolas, ‘A multi-                    ceedings of the 7th International Conference on Communica-
     variable regulator approach to traffic-responsive network-wide              tion Systems and Networks (COMSNETS), pp. 1–6, (2015).
     signal control’, Control Engineering Practice, 10(2), 183–195,         [21] L. A. Prashanth and Shalabh Bhatnagar, ‘Reinforcement learn-
     (February 2002).                                                            ing with average cost for adaptive control of traffic lights at
 [6] Bernhard Friedrich, ‘Adaptive signal control – an overview’,                intersections’, in Procedings of 14th International Conference
     in Proc. of the 9th Meeting of the Euro Working Group Trans-                on Intelligent Transportation Systems (ITSC), pp. 1640–1645.
     portation, Bari, Italy, (2002).                                             IEEE, (2011).
 [7] Wade Genders and Saiedeh Razavi, ‘Evaluating reinforcement             [22] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Intro-
     learning state representations for adaptive traffic signal con-             duction, MIT Press, Cambridge, MA, 1998.
     trol’, Procedia Computer Science, 130, 26–33, (2018). The 9th          [23] Theresa Thunig, Nico Kühnel, and Kai Nagel, ‘Adaptive traffic
     International Conference on Ambient Systems, Networks and                   signal control for real-world scenarios in agent-based transport
     Technologies (ANT 2018).                                                    simulations’, Transportation Research Procedia, 37, 481–488,
 [8] D. Grether, J. Bischoff, and K. Nagel, ‘Traffic-actuated signal             (2019).
     control: Simulation of the user benefits in a big event real-          [24] Elise van der Pol, Deep Reinforcement Learning for Coordina-
     world scenario’, in 2nd International Conference on Models                  tion in Traffic Light Control, Ph.D. dissertation, University of
     and Technologies for ITS, Leuven, Belgium, (2011).                          Amsterdam, 2016.
 [9] D. Grether and T. Thunig, ‘Traffic signals and lanes’, In Horni        [25] Harm van Seijen, Ashique Rupam Mahmood, Patrick Pilarski,
     et al. [12], chapter 12.                                                    Marlos Machado, and Richard Sutton, ‘True online temporal-
[10] D. S. Grether, Extension of a Multi-Agent Transport Simulation              difference learning’, Journal of Machine Learning Research,
     for Traffic Signal Control and Air Transport Systems, Ph.D. dis-            17, 1–, (09 2016).
     sertation, TU Berlin, Berlin, 2014.                                    [26] Harm Van Seijen and Richard S. Sutton, ‘True online TD(λ)’,
[11] Ricardo Grunitzki, Bruno C. da Silva, and Ana L. C. Bazzan,                 in Proceedings of the 31st International Conference on Interna-
     ‘Towards designing optimal reward functions in multi-agent re-              tional Conference on Machine Learning - Volume 32, ICML’14,
     inforcement learning problems’, in Proc. of the 2018 Interna-               p. I–692–I–700. JMLR.org, (2014).
     tional Joint Conference on Neural Networks (IJCNN 2018), Rio           [27] Christopher J. C. H. Watkins and Peter Dayan, ‘Q-learning’,
     de Janeiro, (Jul 2018).                                                     Machine Learning, 8(3), 279–292, (1992).
[12] The Multi-Agent Transport Simulation MATSim, eds., A. Horni,           [28] F. V. Webster, ‘Traffic signal setting’, Technical Report 39,
     K. Nagel, and K. W. Axhausen, Ubiquity, London, 2016.                       Road Research Laboratory, London, (1958).
[13] P. B. Hunt, D. I. Robertson, R. D. Bretherton, and R. I. Win-          [29] Hua Wei, Guanjie Zheng, Vikash V. Gayah, and Zhenhui
     ton, ‘SCOOT– A Traffic Responsive Method of Coordinating                    Li, ‘A survey on traffic signal control methods’, CoRR,
     Signals’, TRRL Laboratory Report 1014, TRRL, Crowthorne,                    abs/1904.08117, (2019).
     Berkshire, UK, (1981).                                                 [30] Kok-Lim Alvin Yau, Junaid Qadir, Hooi Ling Khoo, Mee Hong
[14] George Konidaris, Sarah Osentoski, and Philip Thomas, ‘Value                Ling, and Peter Komisarczuk, ‘A survey on reinforcement
     function approximation in reinforcement learning using the                  learning models and algorithms for traffic signal control’, ACM
     fourier basis’, in Proceedings of the Twenty-Fifth AAAI Con-                Comput. Surv., 50(3), (2017).
     ference on Artificial Intelligence, AAAI’11, p. 380–385. AAAI          [31] Rusheng Zhang, Akihiro Ishikawa, Wenli Wang, Benjamin
     Press, (2011).                                                              Striner, and Ozan K. Tonguz, ‘Partially observable reinforce-
[15] Nico Kühnel, Theresa Thunig, and Kai Nagel, ‘Implementing                  ment learning for intelligent transportation systems’, CoRR,
     an adaptive traffic signal control algorithm in an agent-based              abs/1807.01628, (2018).
     transport simulation’, Procedia Computer Science, 130, 894–            [32] D. Ziemke, I. Kaddoura, and K. Nagel, ‘The MATSim Open
     899, (2018).                                                                Berlin Scenario: A multimodal agent-based transport simula-
[16] S. Lämmer and D. Helbing, ‘Self-control of traffic lights and              tion scenario based on synthetic demand modeling and open
     vehicle flows in urban road networks’, Journal of Statistical               data’, Procedia Computer Science, 151, 870–877, (2019).
     Mechanics: Theory and Experiment, 2008(04), 04–019, (2008).
[17] P. Lowrie, ‘The Sydney coordinate adaptive traffic system -

                                                                        8

</pre>