=Paper=
{{Paper
|id=Vol-2701/paper_2
|storemode=property
|title=A Reinforcement Learning Approach with Fourier Basis Linear Function Approximation for Traffic Signal Control
|pdfUrl=https://ceur-ws.org/Vol-2701/paper_2.pdf
|volume=Vol-2701
|authors=Theresa Ziemke,Lucas N. Alegre,Ana L. C. Bazzan
|dblpUrl=https://dblp.org/rec/conf/ecai/ZiemkeAB20
}}
==A Reinforcement Learning Approach with Fourier Basis Linear Function Approximation for Traffic Signal Control==
A reinforcement learning approach with Fourier basis linear function approximation for traffic signal control Theresa Ziemke 1 and Lucas N. Alegre and Ana L. C. Bazzan 2 Abstract. Reinforcement learning is an efficient, widely used ma- backs of this approach: First, a lot of memory is necessary to keep chine learning technique that performs well when the state and ac- large tables when the number of state-action pairs is huge, which tion spaces are reasonable. This is rarely the case regarding control- tends to be the case in real-world applications. Second, a long explo- related problems, as for instance controlling traffic signals. Here, the ration time is required to fill such tables accurately. Those authors state space can be very large. In order to deal with the curse of dimen- then suggest that generalization techniques may help in addressing sionality, a rough discretization of such space can be used but this is this so-called curse of dimensionality. effective just up to a certain point. A way to mitigate this is to use An efficient representation of the states is a key factor that may techniques that generalize the state space such as function approxi- limit the use of the standard RL algorithms in problems that involve mation. In this paper, a linear function approximation is used. Specif- several agents. Moreover, in scenarios in which the states are repre- ically, SARSA(λ) with Fourier basis features is implemented to sented as continuous values, estimation of the state value by means control traffic signals in the agent-based transport simulation MAT- of tabular Q-values may not be feasible. To deal with this problem, Sim. The results are compared not only to trivial controllers such as in this paper a true online SARSA(λ) algorithm with Fourier Basis fixed-time, but also to state-of-the-art rule-based adaptive methods. linear function approximation is used. As discussed ahead, this op- It is concluded that SARSA(λ) with Fourier basis features is able to tion is based on the fact that non-linear function approximation has outperform such methods, especially in scenarios with varying traffic several drawbacks. demands. The RL-based adaptive signal control algorithm was implemented in the open-source agent-based transport simulation MATSim [12]. In MATSim, it is possible to investigate the impact of the RL-based 1 Introduction adaptive signal control algorithm and compared it to other fixed-time Traffic signal control is a challenging real-world problem. Current or adaptive signal control methods. For comparison, we run our ap- solutions to this problem, such as adaptive systems like SCOOT [13], proach against a rule-based adaptive signal control algorithm based are often centralized or at least partially centralized if each controller on Lämmer and Helbing [16], which was implemented in MATSim is in charge of a portion of the urban network. Alternatives are man- in a previous study [15, 23]. The results show that the RL-based ap- ual interventions from traffic operators or the use of fixed-time sig- proach is able to outperform these approaches in a single intersec- nal plans. However, in the era of big data and advanced computing tion scenario. This is especially notable, as these approaches were power, other paradigms are becoming more and more prominent. designed specifically for dealing with the control of signals, whereas Among these we find those derived from machine learning in gen- the RL-based approach needs no domain knowledge. To the authors eral, and reinforcement learning (RL) in particular. In RL, traffic sig- best knowledge, virtually no other work in the literature (especially nal controllers located at intersections can be seen as autonomous those stemming from the RL area) includes such kind of comparison. agents that learn while interacting with the environment. More often than not, comparison of RL approaches is made only to The use of RL is associated with challenging issues: the environ- a fixed-time scheme. ment is dynamic (and thus agents must be highly adaptive), agents The remaining of this paper is organized as follows. The next sec- must react to changes in the environment at individual level while tion discusses background and related work. Sect. 3 describes the two also causing an unpredictable collective pattern, as they act in a cou- adaptive signal control approaches used in this study: The rule-based pled environment. Therefore, traffic signal control poses many chal- signal control based on Lämmer and the RL-based approach. Exper- lenges for standard techniques of multiagent learning. iments and results are presented in Sect. 4, whereas Sect. 5 contains To understand these challenges, let us first discuss the single agent a discussion of the results and future work. case, where one agent performs an action once in a given state, and learns by getting a signal (reward) from the environment. To put it simply, RL techniques are based on estimates of values for state- 2 Background and related work action pairs (the so-called Q-values). These values may be repre- 2.1 Traffic signal control sented as a table with one entry for each state-action pair. This works well in single agent problems and/or when the number of states and In contrast to fixed-time signals that cyclically repeat a given sig- actions is small. However, in [22] Sutton and Barto discuss two draw- nal plan, traffic-responsive signals react to current traffic by adjust- 1 Technische Universität Berlin, Germany, email: tziemke@vsp.tu-berlin.de ing signal states based on sensor-data (e.g., from upstream inducting 2 Universidade Federal do Rio Grande do Sul, Brazil, email: loops or cameras). They can, therefore, react to changes in demand {lnalegre,bazzan}@inf.ufrgs.br and reduce emissions and waiting times more efficiently. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). A variety of traffic-responsive signal control algorithms have been Using RL for traffic signal control is especially promising, as one developed. An overview is given, e.g., by Friedrich [6]. Different lev- does not need a lot of domain knowledge (as opposed to, e.g., rule- els of adjustment are distinguished: actuated signals use a fixed-time based approaches); rather, the controller learns a policy by itself. base plan and adjust parameters like green split, cycle time or off- However, issues may arise with the aforementioned curse of dimen- set. (Fully) adaptive signals decide about the signal states on the fly. sionality. In fact, depending on the specific formulation (e.g., how They can modify phase orders or even combine signals into different states and action spaces are defined), the search space can be very phases over time. With this, the flexibility of the signal optimization high. For instance, consider an intersection with four incoming ap- is augmented, which increases the possible improvement, but makes proaches with three lanes per approach. If we define the state as the the optimization problem more complex. In order to reduce complex- queue length for each lane discretized in 10 levels, we would have ity and communication effort between sensors and a central com- (4 × 3)10 distinct possible states. The reader is referred to [30] for putation unit (which controls signal states system-wide), decentral- several variants of such formulations. ized (also called self-controlled) methods decide locally about signal In [18, 20, 21], RL is used by traffic signals in order to learn a pol- states without complete knowledge of the system. Usually, every sig- icy that maps states (normally queues at junctions) to actions (nor- nalized intersection has its own processing unit that accounts for up- mally keeping/changing the current split of green times among the stream (and sometimes downstream) sensor data of all approaches. lights of each phase). In [21] the approach is centralized (a single A challenge of decentralized systems is to still ensure system-wide entity holds the MDP for all traffic signals); a central authority re- stability, especially when dealing with oversaturated conditions. A ceives information about the length of the queues and elapsed time number of methods were developed that tackle these challenges. from various lanes to make a decision about timings at each signal. Examples of traffic-responsive approaches from various genera- On the other hand, the approaches in [18] and [20] are decentralized. tions and technological basis are: SCOOT [13] SCATS [17]; TUC Each junction learns independently (normally using QL). (Traffic-responsive Urban Traffic Control) [5]; and TUC combined Since most of these works use QL, and thus approximate the Q- with predictive control [4]. Some can be considered as rule-based as function as a table, they may fall prey to the curse of dimensionality. for example Lämmer and Helbing [16]), while others use techniques This arises when one deals with realistic scenarios, as, e.g., those from RL and model signals as learning agents (see Sect. 2.2). beyond 2-phase intersections that are common in the literature. In order to address this, a few works used function approximation. For instance, [1] uses tile coding in function approximation. How- 2.2 Reinforcement learning ever, the definition of states only consider queue length. Recently, many studies have achieved impressive results using In RL, an agent’s goal is to learn an optimal control policy π ∗ , which deep neural networks to approximate the Q-function (e.g., DQN maps a given state to the best appropriate action by means of a value [19, 24, 31]). However, linear function approximation has guaranteed function. We can model RL as a Markov decision process (MDP) convergence and error bounds, whereas non-linear function approxi- composed by a tuple (S, A, T, R), where S is a set of states; A is a mation is known to diverge in multiple cases [2, 22]. Moreover, linear set of actions; T is the transition function that models the probability function approximation is less computation-intensive, as it relies on of the system moving from a state s ∈ S to a state s0 ∈ S, upon a significantly fewer number of parameters. Thus, if the Q-function performing action a ∈ A; and R is the reward function that yields a can be linearly approximated with sufficient precision, linear func- real number associated with performing an action a ∈ A when one tion approximation methods are preferable. is in state s ∈ S. An experience tuple hs, a, s0 , ri denotes the fact that the agent was in state s, performed action a and ended up in s0 with reward r. Let t denote the tth step in the policy π. In an infinite 2.3 Transport simulation horizon MDP, the cumulative reward in the future under policy π is defined by the Q-function, Eq 1, where γ ∈ [0, 1] is the discount As deployment, operations, and maintenance costs of traffic- factor for future rewards. responsive signals in general are high, transport simulation tools pro- vide a perfect environment to systematically test and evaluate new ∞ hX i signal control methods before applying them in the field. Qπ (s, a) = E γ τ rt+τ |st = s, at = a, π (1) The agent-based transport simulation MATSim [12], which is used τ =0 in this study, is especially suitable in this regard, as it is able to Since the agent’s objective is to maximize the cumulative reward, run large-scale real-world simulations in reasonable time as. Sim- if it learned the optimal Q-values Q∗ (s, a) for all state-actions pairs, ulations can be build based on open data (see, e.g., the open Berlin then the optimal control policy π ∗ is as follows3 : scenario [32]) such that the impact of new signal control approaches can be easily analyzed for arbitrary scenarios4 and compared to other π ∗ (s) = argmax Q∗ (s, a) ∀s ∈ S, a ∈ A. (2) control methods. Because of its agent-based structure, agent-specific a waiting times and varying queue lengths over time at traffic lights can be directly analyzed and compared. RL methods can be divided into two categories: Model-based In MATSim traffic is modeled by agents (i.e., persons) that follow methods assume that the transition function T and the reward func- a daily plan of activities and trips. Traffic flow is modelled mesoscop- tion R are available, or instead try to learn them. Model-free meth- ically by spatial first-in-first-out (FIFO) queues. Vehicles at the head ods, on the other hand, do not require that the agent have access to of a queue can leave a link when the following criteria are fulfilled: information about how the environment works. (1) The link’s free-flow travel time has passed, (2) the flow capac- There are many studies that use RL to improve traffic signal per- ity of the link is not exceeded in the given time step, and (3) there formance. Due to space restrictions, we refer the reader to some sur- vey papers (with different purposes): [3, 18, 29, 30]. 4 An example on how to start a MATSim simulation using the RL signal control presented in this paper can be found at http://matsim.org/ 3 For converge guarantees, in the case of QL, please see [27]. javadoc → signals → RunSarsaLambdaSignalsExample. 2 vehicles that are registered by sensors. Given a prediction of the ex- pected queue length n̂i (t, τ ) at time τ > t and the maximum outflow rate qimax for phase i, one can derive the expected required green time ĝi (t, τ ) for clearing the queue at time t using ĝi (t, τ ) = n̂qimax (t,τ ) . i With this, the priority index is calculated as follows: (a) Graph structure of a link with multiple lanes. n̂i (t, τ ) max , if i = σ(t) τ (t)≤τ i ≤τ 0 τ i + ĝi (t, τ ) πi (t) = (3) n̂i (t, τi0 ) , if i 6= σ(t) . pen (b) Multiple queues; spillback is captured correctly. τσ(t) (t) + τi0 + ĝi (t, τi0 ) Figure 1: Links with multiple lanes in MATSim. Each lane is rep- resented by its own FIFO queue. Traffic signal control for different turning moves is captured. Vehicles on different lanes can pass each Two cases are distinguished depending on whether the phase i is al- other, unless the queue spills over. Source: [9] ready active or not. In either case, the equation basically divides the number of vehicles by the time needed to clear the queue including is enough space on the next link. Despite this simplistic modeling the (remaining) intergreen time. The priority index can, therefore, be approach, congestion as well as spillback can be modeled. interpreted as a clearance efficiency rate. τ includes either the effect The traffic signal control module was developed by Grether as an of remaining intergreen time for the selected phase (when it has not extension to MATSim [10]. If a signal exists on a link, leaving the yet switched to green), or a lookahead beyond the end of the current link is not possible while it shows red. First studies focused on fixed- queue. It is bounded from below by the remaining intergreen time time signals, but also approaches for traffic-responsive signal control τi (t), since that time, if larger than zero, will be incurred before traf- have been implemented [8, 15, 23]. Separated waiting queues at in- fic can flow, and from above by the full intergreen time τi0 , since be- tersections can be modeled in MATSim by lanes (see Fig. 1), which yond that it is possible to just switch back from some other state. For is especially useful to model protected left turns. Signals and lanes in a non-active phase (i.e., i 6= σ(t)), the priority index is reduced by a MATSim are more extensively described by Grether and Thunig [9]. pen canceling penalty τσ(t) (t). This prevents the optimizing regime from Events of vehicles entering or leaving links and lanes are thrown frequently switching signal phases. The penalty can be interpreted on a second-by-second time resolution in the simulation. Sensors on as the average additional waiting time for vehicles at the previously links or lanes that detect single vehicles can be easily modeled by served links that would occur upon cancellation. The priority index listening to these events. As in reality, the maximum forecast period as it is defined in Eq. 3 assumes that each signal phase only serves of such sensors is limited – vehicles can only be detected when they one link – which is why phases and links are both denoted by i here. have entered the link. If a link is short, forecasts might not be ac- The algorithm was further extended to be able to deal with realistic curate. In the simulation, responsive signals use these sensor data to traffic situations like lanes, phase combination with opposing traf- react dynamically to approaching vehicles. For every signalized in- fic, minimum green times, and overload. Since this extensions make tersection, the control unit is called every second to decide about cur- the equation less readable while the main method stays the same, the rent signal states. With that, also RL-based signal control approaches authors refer to Thunig et. al [23] for more details. can be easily installed into the simulation framework. An enclosing stabilizing strategy ensures that each link is at least In general, MATSim can model user reaction as route, mode or served once during a specified minimal service interval to prevent departure time changes. But for this paper, only the traffic flow simu- spillbacks. Links that have to be stabilized are added to a stabilization lation of MATSim is used. Readers interested in the evolutionary part queue. If the queue is non-empty, the phase corresponding to the first of MATSim – i.e., how agents adapt their plans and how long-term element of the queue is switched to green for a guaranteed green time effects can be analyzed – are referred to [12]. gis depending on the average capacity utilization. If the stabilization queue is empty, the optimizing strategy takes over. Lämmer’s control 3 Methods claims to provide intrinsic green waves and locally optimal service, which also results in system-wide optimal service. In this section, the two adaptive signal control approaches used in this An assumption of Lämmer’s algorithm is the queue- study are presented: The rule-based signal control algorithm based on representation of traffic flow: If a link i is served, vehicles can Lämmer and the RL approach based on true online SARSA(λ) with leave the link with a constant outflow rate qimax , which is assumed linear function approximation. to be known. Additionally, queues are assumed to be non-spatially, i.e., the algorithm does not account for vehicles spilling back to upstream lanes or links. Demand is supposed to be manageable on 3.1 Lämmer’s rule-based adaptive traffic signal average with the desired cycle time T to ensure stability. control algorithm Two sensors are used to predict the number of waiting vehicles The idea of the self-controlled signals proposed by Lämmer and Hel- per link and time. One is positioned at the end of the link to detect bing [16] is to minimize waiting times and queue lengths at decen- waiting and outflowing vehicles; the second one is located further tralized intersections while also granting stability through minimal upstream to detect approaching vehicles. Assuming free flow condi- service intervals. The algorithm combines two strategies. The opti- tions at link i, one can estimate the length of the queue ni (t) at time mizing strategy selects the signal phase i to be served next as the t and predict the expected queue length n̂i (t, τ ) at a time τ > t. one with the highest priority index πi (see Eq. 3), which takes into While the estimation of queue lengths allows uncertainty, the mere account outflow rates and queue lengths of waiting and approaching presence of a queue is definite. 3 3.2 True online SARSA(λ) traffic signal controller State Space In RL problems, the definition of state space strongly influences the agents’ behavior and performance. In traffic signal The proposed RL traffic signal controller implements the true on- control, for instance, information related to the level of congestion line SARSA(λ) algorithm [26], a modification of the traditional in the approaching lanes is fundamental in order to appropriately SARSA(λ) that was demonstrated to have better theoretical prop- choose the next active signal phase. erties and outperform the original method [25]. As detailed later, we In the present setting, the agent observes a vector st ∈ Rk at use two kinds of features, thus impacting the space state. In order to each time step t. This vector partially represents the true state of the deal with high dimensional state spaces, the Q-function was linearly controlled intersection and is defined as in Eq. 8, where E is the set of approximated using the Fourier basis scheme [14]. all links of the intersection and L is the set of all approaching lanes, When linear approximation is used, the Q-values Q(s, a) for each ρi ∈ {0, 1} is a binary feature active when i is the current selected discrete action a are approximated as a weighted sum of a set of signal phase, τ ∈ [0, 1] is the elapsed time of the current signal phase m basis functions φ1 , ..., φm , as in Eq. 4, where θ is the vector of divided by the maximal green time gmax , the density ∆e ∈ [0, 1] is weights. defined as the number of vehicles on link e ∈ E divided by it’s m storage capacity and ql ∈ [0, 1] is defined as the number of queued X Q(s, a) = θ · φ(s, a) = θi φi (s, a) (4) i=1 vehicles on lane l ∈ L divided by the storage capacity of the lane. The Fourier series is one of the most commonly used continuous function approximation methods, presenting solid theoretical foun- st = [ρ1 , ..., ρ|σ| , τ, q1 , ..., q|L| , ∆1 , ..., ∆|E| ] (8) dations. In [14], it was empirically shown that Fourier basis outper- This state definition is inspired by [7], where authors achieved similar forms other commonly used approximations methods such as poly- performance levels, even when using more complex state definitions nomial and radial basis functions in continuous RL domains. (e.g., including positions of each vehicle in the approaching lanes). When applying Fourier series to the RL setting, it is possible to As common in the literature, the proposed RL signal control is drop the sin terms of the series5 . Then, for a nth order Fourier ap- only called every three seconds. This means, that one time step for proximation, each basis function φi is defined as in Eq. 5, where the traffic signal agent corresponds to three seconds of simulation. ci = [c1 , ..., ck ] is a vector that attaches an integer coefficient This reduces the complexity and the size of the state space, without c1≤j≤k ∈ [0, ..., n] to each feature in s, and k is the dimension of significantly reducing the performance. the state space. ( cos(πci · s), if a = at Action Space At each time step t (every three seconds), the traffic φi (s, a) = (5) signal controller chooses a discrete action at ∈ A. In our setting, the 0, if a 6= at number of actions is equal to the number of possible signal phases, The basis set of functions φ1 , ..., φm is obtained by systematically therefore, |A| = |σ|. There are two restrictions in the action selec- varying these coefficients. Each coefficient determines the basis tion: the agent can change the current active signal phase only if the function’s frequency along its dimension. Furthermore, as we in- elapsed time is greater or equal than the minimal green time gmin crease the order n of the approximation, more frequencies are used. and keep it only if the elapsed time is less than the maximal green After the execution of action at , the weights θ are updated via time gmax . These restrictions ensure the feasibility of the signal con- gradient descent, following the true online SARSA(λ) with linear troller for real-world applications. function approximation update rule, as in Eq. 6, where δ = rt + γQ(st+1 , at+1 )−Q(st , at ) is the temporal difference error and Qold Reward After taking action at , the traffic signal controller re- is a scalar temporary variable initialized with zero and set to Qold ← ceives a scalar reward rt ∈ R. As in [7] the reward is defined as Q(st+1 , at+1 ) after every step. the change in cumulative delay, as given in Eq. 9, where Dat and Dat+1 represent the cumulative delay at the intersection before and θ ← θ + α(δ + Q − Qold )e − α(Q − Qold )φ (6) after executing the action at . The eligibility traces vector e – which is used to address the credit assignment problem – is updated as in Eq. 7. Each weight update rt = Dat − Dat+1 (9) also takes into account previously visited states, which are credited accordingly to the values accumulated on the vector e. The parameter On its turn, the cumulative vehicle delay D, for any time t, is com- λ controls the decay of the eligibility traces at each time step. puted as in Eq. 10, where Vt is the set of vehicles on incoming ap- proaches and dvt is the delay of vehicle v at time t. e ← γλe + φ − αγλ(eT φ)φ (7) X Dt = dvt (10) Given the base learning rate α, each weight θi is updated with the v∈Vt scaled learning rate αi = α/||ci ||2 , as proposed in [14]. Both the weights and eligibility traces vectors are initialized with zeros. In order to address the exploration–exploitation dilemma, the ε- 4 Experiments and results greedy exploration strategy is used to choose actions: the action with the highest Q-value is selected with a probability of 1 − ε and a 4.1 Scenario random action is selected with probability ε. Next we give the formulations that are specific to the domain of This study focuses on a single intersection scenario with two set-ups, signal control. each for a different demand level. In both, the RL control is compared to a fixed-time signal control and rule-based traffic-responsive signal 5 For detailed explanation, please see [14]. control based on [16] (as introduced in Sect. 3.1). 4 4.2 Results The proposed method of the true online SARSA(λ) with Fourier basis linear function approximation for signal control is applied to the single intersection scenario presented in the previous section and compared to RL signal control methods with other configurations (in Sect. 4.2.1, where our method is compared to a tabular variant), as well as to a fixed-time and a rule-based adaptive signal control approach (in Sect. 4.2.2). Due to the stochastic arrival rates, results presented here are aver- aged over 20 runs with different random seeds, whereby the random seed influences the platoon structure of approaching vehicles (the av- erage flow rate stays the same). Figure 2: Single intersection scenario. The shadowed area in the plots shown ahead depicted the standard deviation regarding average delay or total queue length, accordingly. The lines were smoothed with a moving average window of 300 sec- onds (i.e., 5 minutes) for better clarity. Traffic signals. The single intersection featured here (see Fig. 2) has four incoming approaches. In the horizontal direction, there is a dedicate left turning lane in each traffic approach, as well as three 4.2.1 Comparison with other RL signal control methods lanes for straight traffic. In the vertical direction, there are two lanes for straight traffic. Here we compare the proposed method with the traditional tabular Traffic signals are grouped into three non-conflicting signal SARSA(λ) [22], using the first set-up of the scenario presented in phases: Straight traffic in horizontal direction; left turning traffic in Sect. 4.1. We also discuss optimal settings regarding the order of the horizontal direction; vertical direction. While switching between two Fourier basis approximation, state and reward. signal phases, there is an all red period of one second. The minimum green time for a signal phase is five seconds. Tabular vs. linear SARSA(λ). In order to transform the con- The fixed-time control that is used for comparison purposes is op- tinuous state space defined in Sect. 3.2 to a discrete state space for timized by Webster’s method [28]. It has a cycle time of 40 sec- the tabular SARSA(λ), the queue q and density ∆ attributes were onds and distributes green times according to average flow rates. The discretized in equally distributed bins/intervals. The binary features traffic-responsive signal approaches do not have a fixed cycle time: ρi for each phase are already discrete and the feature τ has a finite For Lämmer’s control algorithm, a desired and a maximal cycle time number of possible values as the elapsed time increases in steps of can be defined (for this scenario 40 and 60 seconds are used, respec- five seconds; therefore, they did not need to be discretized. tively). For the RL control a maximal green time of 30 seconds per In order to allow a fair comparison, the same discount factor, value signal phase is used. As mentioned in Sect. 3.2, the RL control is of λ and exploration rate were used for both methods. The discount only called every three seconds to decide about new signal states. factor was set to γ = 0.95, λ = 0.1 and the exploration rate was set to ε = 0.01 (this latter means that the agent is mostly taking the action with the highest Q-value, but still exploring with a fixed low chance). For the tabular SARSA(λ), a learning rate of α = 0.1 was used, while for true online SARSA(λ) with linear function ap- Demand. In a first set-up, there is traffic going straight in the proximation, α = 10−6 was used. These values are common in the horizontal direction, with 1800 vehicles approaching on average per literature and produced the best results for each method after exten- hour, in each of the two approaches. In the vertical direction, there are sive experimentation with different values. 600 vehicles on average per hour from each side – all going straight. As the state space in this case is large, and the number of Fourier Additionally, there are 180 vehicles on average per hour from both basis functions grows exponentially on the number of dimensions of sides in horizontal direction that want to turn left at the intersection. the state space, it is necessary to restrict the number of basis. We can A period of 86,400 seconds (i.e., one day) is simulated. meet this condition by placing constraints on the coefficient vectors In a second set-up, the demand is doubled during five time periods ci . However, in this setting, adding coefficients with more than two over the day of 2,000 seconds length each, in order to analyze the non-zero elements did not improve the results. Thus, we further lim- effect of fluctuating demand on the performance of the RL controller. ited each coefficient vector ci to have at most two non-zero elements. To be more precise, in the time intervals [0, 2,000), [20,000, 22,000), In Fig. 3 the average delay per vehicle at each second of the simu- [40,000, 42,000), [60,000, 62,000), and [80,000, 82,000) the average lation is depicted for true online SARSA(λ) with Fourier basis lin- flow rates in horizontal direction are 3600 vehicles per hour going ear function approximation and for tabular SARSA(λ) with 8 vs. 10 straight and 360 vehicles per hour going left per approach, whereas in discretization bins of the q and ∆ features. vertical direction the average flow rate per approach is 1200 vehicles With 10 bins, the learning is very slow, as the number of discretiza- per hour. During the rest of the simulation, the average flow rates are tion bins exponentially increases the size of the state space. Reducing the same as for the first scenario set-up. the number of bins to 8 significantly speeds up the learning and re- In both set-ups, arrival rates are stochastic: vehicles are inserted as duces the delay. However, by reducing the number of bins, different platoons with a platoon size that is exponentially distributed around states (in which different actions are optimal) are perceived as the an expected value of five. Also the time gap between vehicle pla- same, thus leading to sub-optimal performance in the long run. toons is exponentially distributed; its expected value is the platoon The usage of function approximation not only avoids the curse of size divided by the average flow value. dimensionality, but introduces generalization, i.e., when updating the 5 350 Tabular SARSA(λ) - 10 bins State with density 300 40 Tabular SARSA(λ) - 8 bins State without density Average delay (s/veh) Average delay (s/veh) 250 True online SARSA(λ) w/ Fourier basis approx. 30 200 150 20 100 10 50 0 0 0 20000 40000 60000 80000 0 20000 40000 60000 80000 Time of day (s) Time of day (s) Figure 3: Average delay for tabular and linear function approximation Figure 5: Impact of state definition RL implementations. 600 Change in cumulative delay n=3 Change in total queue length 500 Average delay (s/veh) 30 n=5 Average delay (s/veh) 25 n=7 400 n=9 20 300 15 200 10 100 5 0 0 20000 40000 60000 80000 0 0 20000 40000 60000 80000 Time of day (s) Time of day (s) Figure 6: Impact of reward definition Figure 4: Impact of different values for the order n of the Fourier basis approximation. using change in cumulative delay as reward not only converges to a better performance, but produces a learning curve that decreases or- Q-function after taking an action in a given state, similar states are ders of magnitude faster. This result shows that the choice of which also affected and have their Q-values changed. With that, the true reward function to use is one of the most critical implementation de- online SARSA(λ) with Fourier basis linear function approximation cisions when designing a reinforcement learning controller. results in a much faster learning curve and overall lower delay values. 4.2.2 Comparison with fixed-time and rule-based signals Order of the Fourier basis approximation. Fig. 4 shows the im- pact that the value of the Fourier approximation order n has on the In this section, the true online SARSA(λ) with Fourier basis lin- agent’s performance. As expected, the higher the value of n, the more ear function approximation is compared to fixed-time and rule-based accurate is the approximation of the Q-function. Changing the order adaptive signals in both set-ups of the single intersection scenario. from n = 3 to n = 9 results in a notable reduction on average delay; Fig. 7 shows the performance regarding average delay and total however, when n is sufficiently high (n = 7 and n = 9), there is queue length for the first set-up (constant average flow rates). It can no further improvement. For this reason, the Fourier approximation be seen that for this, somewhat homogeneous setup, RL and Lämmer order is fixed to n = 7 for all following experiments. perform much better than Webster fixed-time control in terms of av- erage delay and queue length. Also, they produce less variation in these measures, demonstrating robustness against traffic fluctuations. State definition. Although the q (flow) features provide the traffic Note that for constant average flow rates, the fixed-time control used signal control agent with queue on each lane, the ∆ (density) features here (optimized by Webster’s method) is already quite good. RL is are also important, as they inform how many vehicles (that may be able to outperform the fixed scheme because it seems to be more sta- queued in the following seconds) there are on each link. Fig. 5 shows ble regarding platoon variations. This can be seen in both plots in that, by removing the ∆ features from the state definition, the average Fig 7, with the standard deviation (shown as the shadowed area in delay increases. This difference might be even higher in scenarios the plots) being lower for the RL. with very high demand, where a high number of vehicles are moving Fig. 8 depicts the effects of more heterogeneous demand (second and approaching the queues. set-up), where the flow rates are doubled during five time periods over the day (see Sect. 4.1). The RL controller improves its perfor- Reward definition. The definition of the reward function has a mance from the second peak onwards, as in the first peak it was expe- high impact on the performance of the RL controller [11]: In Fig. 6 riencing an overload situation for the first time. In this more difficult the reward function defined in Sect. 3.2 is compared to another re- set-up, the difference in performance between RL and Lämmer be- ward function found in literature [18], defined as the change in total comes less visible, with both presenting the same amount of queue queue length between successive actions. The traffic signal controller length when there is low demand. Additionally, RL decreases the 6 400 150 RL Laemmer Webster fixed-time RL Laemmer Webster fixed-time Average delay (s/veh) Average delay (s/veh) 125 300 100 75 200 50 100 25 0 0 0 20000 40000 60000 80000 0 20000 40000 60000 80000 Time of day (s) Time of day (s) (a) Average delay per vehicle during the simulation. (a) Average delay per vehicle during the simulation. RL Laemmer Webster fixed-time RL Laemmer Webster fixed-time 200 1200 Total queue length (veh) Total queue length (veh) 1000 150 800 100 600 400 50 200 0 0 0 20000 40000 60000 80000 0 20000 40000 60000 80000 Time of day (s) Time of day (s) (b) Total queue length during the simulation. (b) Total queue length during the simulation. Figure 7: Single intersection scenario with constant average flow Figure 8: Single intersection scenario with periodically repeating rates (first set up of Section 4.1) time periods where the average flow rates are doubled (second set up of Sect. 4.1) queue lengths faster than Lämmer after the peaks, which indicates that RL better adapts to different vehicle demands. with other than fixed-time approaches is rarely in the RL literature In summary, despite its slightly lower performance in the first set- and is, therefore, a key feature of this work. up, the RL control is able to handle overload situations and quickly As a next step, the signal control based on true online SARSA(λ) reduces the queues afterwards. Recall that, contrarily to rule-based with Fourier basis linear function approximation will be applied to approaches, the RL control does not require domain knowledge. real-world scenarios using MATSim, and compared to the signal con- trol approaches employed here. Given ongoing experimentation, the performance of the RL control looks very promising, which would 5 Conclusion and future work emphasise its advantage in scenarios that are more challenging. On In the present paper it was shown that specific techniques from RL the one hand, with multiple intersections, one can assess the effects can help to improve the performance of traffic signal control, and of the interaction of adjacent intersections as, e.g., when conges- even outperform state-of-the-art rule-based adaptive signal control tion spills back. On another perspective, this is a multiagent RL task methods. It was argued that tabular RL methods may not be feasi- that is known to be more challenging. Also, it will be investigated, ble due to the curse of dimensionality. When it is possible to employ whether the RL signal control can be further improved to handle sit- them, it is often the case that they need long learning times before uations of overload even better than the current implementation. Fi- convergence in the case of realistic intersections with more than two nally, since the issue of which reward scheme to use seems to be a signal phases and when a more complex definition of state is used. key issue, a possible extension of this work could consider using the Recall that the results presented here show that including more fea- methods for designing a reward function that fits this domain best. tures (i.e., not only queue but also density) played a significant role in the performance. To address the curse of dimensionality, we used Fourier basis lin- ACKNOWLEDGEMENTS ear function approximation alongside the true online SARSA(λ) algorithm, which to the authors best knowledge was not used for traf- The authors thank Prof. Kai Nagel for his support and supervision fic signal control before. This method was implemented in MATSim during the internship of Lucas N. Alegre at TU Berlin. Thanks also and compared to optimal fixed-time and rule-based adaptive signal to Dr. Ricardo Grunitzki for a previous version of the tabular Q- control in a single intersection scenario, in which the demand was learning code. The authors are grateful to the Deutsche Forschungs- varied. It could be seen that our approach outperforms the fixed-time gemeinschaft (DFG) for funding the project Optimization and net- controller and is competitive with the rule-based adaptive controller work wide analysis of traffic signal control, as well as to the Brazilian in terms of average delay and queue length. This kind of comparison Research Council (grant no. 307215/2017-2 for Ana Bazzan). 7 REFERENCES principles, methodology, algorithms’, in Proceedings of the International Conference on Road Traffic Signalling, Sydney, [1] Monireh Abdoos, Nasser Mozayani, and Ana L.C. Bazzan, ‘Hi- Australia, (1982). erarchical control of traffic signals using Q-learning with tile [18] Patrick Mannion, Jim Duggan, and Enda Howley, ‘An experi- coding’, Appl. Intell., 40(2), 201–213, (2014). mental review of reinforcement learning algorithms for adap- [2] Leemon Baird, ‘Residual algorithms: Reinforcement learning tive traffic signal control’, in Autonomic Road Transport Sup- with function approximation’, in Machine Learning Proceed- port Systems, eds., Thomas Leo McCluskey, Apostolos Kot- ings 1995, 30 – 37, Morgan Kaufmann, (1995). sialos, P. Jörg Müller, Franziska Klügl, Omer Rana, and René [3] Ana L. C. Bazzan, ‘Opportunities for multiagent systems Schumann, 47–66, Springer International Publishing, Cham, and multiagent reinforcement learning in traffic control’, Au- (May 2016). tonomous Agents and Multiagent Systems, 18(3), 342–375, [19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex (June 2009). Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Ried- [4] Lucas Barcelos de Oliveira and Eduardo Camponogara, ‘Multi- miller, ‘Playing atari with deep reinforcement learning’, in agent model predictive control of signaling split in urban traf- NIPS Deep Learning Workshop, (2013). fic networks’, Transportation Research Part C: Emerging Tech- [20] K. J. Prabuchandran, A. N. Hemanth Kumar, and Shalabh Bhat- nologies, 18(1), 120–139, (2010). nagar, ‘Decentralized learning for traffic signal control’, in Pro- [5] C. Diakaki, M. Papageorgiou, and K. Aboudolas, ‘A multi- ceedings of the 7th International Conference on Communica- variable regulator approach to traffic-responsive network-wide tion Systems and Networks (COMSNETS), pp. 1–6, (2015). signal control’, Control Engineering Practice, 10(2), 183–195, [21] L. A. Prashanth and Shalabh Bhatnagar, ‘Reinforcement learn- (February 2002). ing with average cost for adaptive control of traffic lights at [6] Bernhard Friedrich, ‘Adaptive signal control – an overview’, intersections’, in Procedings of 14th International Conference in Proc. of the 9th Meeting of the Euro Working Group Trans- on Intelligent Transportation Systems (ITSC), pp. 1640–1645. portation, Bari, Italy, (2002). IEEE, (2011). [7] Wade Genders and Saiedeh Razavi, ‘Evaluating reinforcement [22] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Intro- learning state representations for adaptive traffic signal con- duction, MIT Press, Cambridge, MA, 1998. trol’, Procedia Computer Science, 130, 26–33, (2018). The 9th [23] Theresa Thunig, Nico Kühnel, and Kai Nagel, ‘Adaptive traffic International Conference on Ambient Systems, Networks and signal control for real-world scenarios in agent-based transport Technologies (ANT 2018). simulations’, Transportation Research Procedia, 37, 481–488, [8] D. Grether, J. Bischoff, and K. Nagel, ‘Traffic-actuated signal (2019). control: Simulation of the user benefits in a big event real- [24] Elise van der Pol, Deep Reinforcement Learning for Coordina- world scenario’, in 2nd International Conference on Models tion in Traffic Light Control, Ph.D. dissertation, University of and Technologies for ITS, Leuven, Belgium, (2011). Amsterdam, 2016. [9] D. Grether and T. Thunig, ‘Traffic signals and lanes’, In Horni [25] Harm van Seijen, Ashique Rupam Mahmood, Patrick Pilarski, et al. [12], chapter 12. Marlos Machado, and Richard Sutton, ‘True online temporal- [10] D. S. Grether, Extension of a Multi-Agent Transport Simulation difference learning’, Journal of Machine Learning Research, for Traffic Signal Control and Air Transport Systems, Ph.D. dis- 17, 1–, (09 2016). sertation, TU Berlin, Berlin, 2014. [26] Harm Van Seijen and Richard S. Sutton, ‘True online TD(λ)’, [11] Ricardo Grunitzki, Bruno C. da Silva, and Ana L. C. Bazzan, in Proceedings of the 31st International Conference on Interna- ‘Towards designing optimal reward functions in multi-agent re- tional Conference on Machine Learning - Volume 32, ICML’14, inforcement learning problems’, in Proc. of the 2018 Interna- p. I–692–I–700. JMLR.org, (2014). tional Joint Conference on Neural Networks (IJCNN 2018), Rio [27] Christopher J. C. H. Watkins and Peter Dayan, ‘Q-learning’, de Janeiro, (Jul 2018). Machine Learning, 8(3), 279–292, (1992). [12] The Multi-Agent Transport Simulation MATSim, eds., A. Horni, [28] F. V. Webster, ‘Traffic signal setting’, Technical Report 39, K. Nagel, and K. W. Axhausen, Ubiquity, London, 2016. Road Research Laboratory, London, (1958). [13] P. B. Hunt, D. I. Robertson, R. D. Bretherton, and R. I. Win- [29] Hua Wei, Guanjie Zheng, Vikash V. Gayah, and Zhenhui ton, ‘SCOOT– A Traffic Responsive Method of Coordinating Li, ‘A survey on traffic signal control methods’, CoRR, Signals’, TRRL Laboratory Report 1014, TRRL, Crowthorne, abs/1904.08117, (2019). Berkshire, UK, (1981). [30] Kok-Lim Alvin Yau, Junaid Qadir, Hooi Ling Khoo, Mee Hong [14] George Konidaris, Sarah Osentoski, and Philip Thomas, ‘Value Ling, and Peter Komisarczuk, ‘A survey on reinforcement function approximation in reinforcement learning using the learning models and algorithms for traffic signal control’, ACM fourier basis’, in Proceedings of the Twenty-Fifth AAAI Con- Comput. Surv., 50(3), (2017). ference on Artificial Intelligence, AAAI’11, p. 380–385. AAAI [31] Rusheng Zhang, Akihiro Ishikawa, Wenli Wang, Benjamin Press, (2011). Striner, and Ozan K. Tonguz, ‘Partially observable reinforce- [15] Nico Kühnel, Theresa Thunig, and Kai Nagel, ‘Implementing ment learning for intelligent transportation systems’, CoRR, an adaptive traffic signal control algorithm in an agent-based abs/1807.01628, (2018). transport simulation’, Procedia Computer Science, 130, 894– [32] D. Ziemke, I. Kaddoura, and K. Nagel, ‘The MATSim Open 899, (2018). Berlin Scenario: A multimodal agent-based transport simula- [16] S. Lämmer and D. Helbing, ‘Self-control of traffic lights and tion scenario based on synthetic demand modeling and open vehicle flows in urban road networks’, Journal of Statistical data’, Procedia Computer Science, 151, 870–877, (2019). Mechanics: Theory and Experiment, 2008(04), 04–019, (2008). [17] P. Lowrie, ‘The Sydney coordinate adaptive traffic system - 8