1. Introduction

Journal of Information and Data Management 15 (2024) 186-195. doi:10.5753/jidm.2024.3328. [19] H. U. Gobbi

10.1007/978-3-319-25808-9∖_4

Learning From Your Virtual Neighbors: a Reinforcement Learning Approach to Trafic Signal Control

Ana L. C . Bazzan

Henrique U. Gobbi

Computer Science

Universidade Federal do Rio Grande do Sul (UFRGS)

Porto Alegre

Brazil

2016

3173 47 66

Many problems in trafic management and control are inherently distributed and/or require adaptation to the trafic situation. Hence, there is a close relationship to multiagent reinforcement learning. However, using reinforcement learning poses challenges when the state space is large. For instance, the learning task may take long. In order to accelerate this process, we allow agents that are similar to transfer experiences. We run experiments using a trafic network in which we vary the trafic situation along time. We compare our approach to the use of Q-learning without transfer of experiences, and show that there is an overall improvement in the number of stopped vehicles.

eol>Trafic Signal Control Multiagent Reinforcement Learning

1. Introduction 2.1. Reinforcement learning

to be followed by the agent. One of these strategies is -greedy, where an action is randomly chosen (exploration) with a probability , or, with probability 1-, the best known action is chosen, i.e., the one with the highest value so far (exploitation).

It is assumed that the agent has sensors to determine its current state and can then decide on an action. The reward is then used to update its policy, i.e., a mapping from states to actions. This policy can be generated or computed in several ways.

For the sake of the present discussion, we concentrate on a model-free, of-policy algorithm called Q-learning [ 1 ], which estimates so-called Q-values using a table to store the experienced values of performing a given action when in a given state. Hence Q-learning is a tabular method, where the state space and the action space need to be discretized.

In RL, the learning task is usually formulated as a Markov decision process (MDP), that defines the sets of states and actions, a transition function, and a reward function. Since the transition and the reward functions are unknown to the agent, its task is precisely to learn them, or at least a model for them.

In Q-learning, the value of a state and action at time is updated based on Eq. 1, where ∈ [ 0, 1 ] is the learning rate, ∈ [ 0, 1 ] is the discount factor, +1 is the next state and is the reward received when the agent moves from to +1 after selecting action in state .

(, ) ← (, ) + ( + max((+1, )) − (, )) (1)

When there are multiple agents interacting in a common environment, the RL task (thus, multiagent RL) is inherently more complex because agents’ actions are highly coupled and agents are trying to adapt to others that are also learning. Despite this drawback, in many real-world problems, where the control is decentralized, it might not be possible to avoid a multiagent RL formulation. This is the case regarding the scenario we deal with in the present paper, namely control of trafic signals, whose basic concepts we briefly review next.

2.2. RL-based trafic signal control

Besides safety and other issues, one aim of a trafic signal controller is to decide on a split of green times among the various phases that were designed to deal with geometry and flow issues at an intersection. This can be done in several ways (for more details, please see a textbook such as [ 2 ]). In this paper, the controller is given a set of phases and has to decide which one will receive right of way (green light).

A phase is defined as a group of non-conflicting movements (e.g., flow in two opposite trafic directions) that can have a green light at the same time without conflict.

In its simplest form, the control is based on fixed times, whose split of the green time among the various phases can be computed based on historical data on trafic flow, if available. The problem with this approach is that it may have dificulties adapting to changes in the trafic demand. This may lead to an increase in the number of stopped vehicles. To mitigate this problem, it is possible to use an adaptive scheme and thus give priority to lanes with longer queues (or other measures of performance). Adaptive approaches based on RL were developed, as we discuss next.

3. Related work

Trafic control techniques stem mainly from the areas of control theory and operations research. More recently though, techniques from artificial intelligence and multiagent systems have also been employed, especially in connection with RL. RL is used by trafic signals to learn a policy that maps states (usually queues at intersections) to actions. Due to the number of works that employ RL for trafic control, the reader is referred to surveys [ 3, 4, 5, 6, 7 ]. In any case, those surveys show that there has already been a significant contribution of RL techniques to control of trafic signals. However, some issues arise, especially if tabular methods (e.g., the aforementioned Q-learning) are used. Tabular methods require agents to visit all state-action pairs. In order to accelerate this process, in the present paper we propose the use of transfer of experiencies among similar agents.

Other authors have addressed that issue in diferent ways. Next, we briefly review some of these approaches.

A first line of research does use tabular methods, with various levels of discretization of the state space, thus depicting diferent levels of performance of the learning task. In this class, well-known works include [8, 9, 10, 11], among others.

Other works avoid using tabular methods, substituting it for various kinds of function approximation techniques such as tile coding [12]; deep neural networks, such as DQN [13]; and linear function approximation [14].

Accelerating the RL task underlying trafic signal control was addressed using several techniques. For example, in [15], agents communicate using a hierarchy of supervising agents; in [16], -nearest neighbors is employed to estimate the Q-values of an state by calculating the weighted average of the Q-value estimates of the nearest states.

Finally, the idea of using a graph to establish the relationship among learning agents was used in trafic networks in [17, 18, 19]; however, these works aimed at driver agents rather than junctions/intersections.

4. Methodology

As mentioned, our approach is based on using (potentially non-local) communication to augment the information each trafic signal agent has. In the next sections, we detail the communication aspect (Section 4.3). For this purpose we employ a graph of relationships. Because this graph does not necessarily takes only agents that are neighbors into account, we call it a virtual graph (henceforth VG); this is explained in Section 4.1. The underlying MDP is explained in Section 4.2.

4.1. The Road and the Virtual Graphs

The road network, as conventionally used in trafic and transportation engineering, is a graph = (, ), where is the set of junctions (intersections), and is the set of links. We use the term link, since it is more commonly used in trafic engineering (and then reserve the term edge for another graph, as described ahead). One instance of is depicted in the next section (Fig. 2), where we discuss that scenario in more detail. Here, it sufices to note that intersections B2 and C2 (for instance) are geographical neighbors.

We also define another graph – the virtual graph –, which accounts for non-local information. In such graph, we connect two junctions 1 ∈ and 2 ∈ , which are not necessarily physically close (as, e.g., D2 and B2 in Fig. 2), but that may have similar patterns in terms of the learning task. We call this a virtual graph denoted by = (, ), where is as defined before, and is the set of edges that connect two junctions that have similar patterns.

In order to define when two junctions are to be connected in , historical information is collected for a road network . This information refers to several attributes: travel time over all links in an intersection, fuel consumption and several kinds of gas emissions such as CO, CO2, HC (hydrocarbon), PMx (particulate matter), and NOx. This information is collected per lane, per time interval, and then aggregated so that each junction has a single value associated with each of those attributes (in our case, we collect such data from a microscopic simulator). Moreover, such information is aggregated over a time window ℎ. Then, values of all attributes are normalized between zero and one.

After the normalization, the values of the attributes for each two pairs of junctions are compared. If two junctions 1 and 2 have the same values for all attributes (given a tolerance value, i.e. ± ), then an edge connecting 1 and 2 is inserted in the .

Fig. 1a shows an instance of such a virtual graph, whereas Fig. 1b depicts a zoom of that graph, where some relationships among similar junctions can be better seen. The labels of the vertices are formed by the junction ID plus the timestamp in which their values were found to be similar.

The agent keeps a record of each visited pair state-action, as well as an estimate of the Q-values for each of these pairs. Such values depend on the rewards obtained by performing action at state . Rewards are given by the number of vehicles that are queued in a given intersection. This value is provided by the microscopic simulator. As explained next, an agent may also receive information from other agents.

4.3. How Communication Works

Next we briefly explain how the communication is performed by the elements of the road network . We assume that every junction ∈ a is equipped with a communication device (henceforth, CommDev) that is able to send and receive messages to and from other junctions. For instance, in Fig. 2, junctions D2 and B2 have a relationship in VG, that refers to a given time step . Thus, they initiate a communication via their respective communication devices in order for D2 to transfer its knowledge to B2.

This way, an agent that has received information perceives it as expected rewards for the action available to it in that particular state. This means that we extend the learning process vis-a-vis the standard Q-learning algorithm. When using Q-learning, at any given step the agents simply update their Q-values based on the feedback from the action they have just taken at a given state. However, in our case, agents also update their Q-values based on the expected rewards received by the CommDev’s. This means that at every time in which an agent needs to make a decision, it asks and receives the following tuple regarding all agents with which it has a relationship in the VG: < , , >, where is a virtual neighbors of . Each of these tuples is then used to update the Q-table of (in case has never seen a given pair (, ), this pair is inserted in the table with the corresponding Q-value).

5. Experiments, Results, and Analysis 5.1. Scenario

Simulations were performed using SUMO [20], a microscopic simulator.

The test scenario is the trafic network with 12 intersections depicted in Fig. 2. Given that the links are one-way, not all intersections require a trafic signal controller. This is the case of the four corner intersections, as well as intersections A2, C1, and B3. Other non-signalized intersections are B1 and C3. These two are regulated by a priority mechanism defined in SUMO, namely all vehicles decelerate before reaching the intersection and SUMO regulates the right of using the intersection. We did this in order to concentrate on the intersections defined by the arterial depicted in the middle of the figure, namely the one that starts at intersection A2 and ends at D2. In this arterial, there are three signalized intersections that are then each controlled by a learning agent, namely B2, C2, and D2. Recall that, due to its geometry (no conflict between approach links), A2 does not require a trafic signal.

Each of the agents learns using the method described in Section 4.

We also stress that we have created this network with a particular feature in mind: the total number of vehicles is kept constant (except for the first steps, when they are still being inserted at the link B3-C3). This avoids many problems that are common in the literature, where it was not possible to fully isolate the behavior of the RL approach from SUMO’s mechanism for routing vehicles.

In the network depicted in Fig. 2, there is a re-routing device in the link located right before intersection A2 that is responsible to re-route each vehicle when it reaches that link. Essentially, vehicles keep running in loops, never leaving the simulation. This allows us to control how the trips use the network, which is an important question we deal with here, as discussed ahead (non-stationarity). Specifically, 200 trips are generated (which, for this network represents almost 50% of overall density, given that its maximum capacity is around 500 vehicles).

Another feature of our scenario is non-stationarity due to changes in the trafic volume that traverses each intersection. In previous works such as [21, 16] it was shown that the respective methods were able to adapt to change in trafic contexts or simply context. By context we mean that, from time to time, A2 A3 B2 B3 C3

D2 D3 the way trips are distributed over the routes are changed. Note that this does not mean that the number of trips (vehicles) using the network changes, but rather that each trip may use a diferent route, thus the use of the links, and, hence, of the intersections, difers along time.

These changes in context occur at each 5,000 simulation steps, as depicted in Fig. 3, where the time steps are represented in the axis. Note that the figure just shows one change in context (at step 5,000), while in fact there is a change at each 5000 steps, in a repeated way.

As seen, in the first context (see bottom part of Fig. 3), both junctions B2 and C2 have similar situations – receiving more volume of vehicles in the horizontal direction – while D2 receives near equal volumes in both directions. At time step 5,000, there is a change in this patters (see top part of the figure). At time step 10,000 the former context occurs again and so on.

5.2. Values of the Parameters

Each simulation runs for 15,000 seconds. Our plots show the average and deviations over 15 repetitions of each case.

The value used for minGreenTime and for maxGreenTime were was 10 and 50 seconds respectively. Also, we recall that one action time step corresponds to five seconds of real-life clock time.

As for the learning parameters, after experimenting with other values, we have set = 0.05, = 0.95 and starting at 1.0 and decaying by the rate of 0.995 at each action time step, up to a minimum value of = 0.05, which is commonly used in the literature.

Finally, regarding the VG, we tried several values for the ; here we show results for = 0.002.

5.3. Results and Discussion

In order to assess the efectiveness and eficiency of our approach, we compare it to the case where Q-learning is used without the VG, as well as with the case in which no learning mechanism is used. Henceforth the latter is referred to as fixed time, since there is a fixed control rule, i.e., the timings of the green signals are optimized once (by SUMO) and remain fixed. We remark that other types of controllers – not necessarily based on RL– could be employed for the sake of comparison; however, due to space limitations, we cannot discuss such results here.

In all cases, as mentioned, the network used is the one depicted in Fig. 2. The assessment is based on number of stopped vehicles, as commonly used in the literature.

Fig. 4 shows the number of stopped vehicles in the whole network (average over 15 repetitions), and it functions as a baseline, given that there is no learning mechanism of any kind. Rather, as aforementioned, the signal timings remain fixed. Note the oscillations found in the average curve (the deviations are given as a light color shadow), caused by the fact that vehicles are randomly routed, but the timing of the signals is fixed. Further, this is the case for all contexts.

We now show and discuss the cases in which the agents learn using QL or QL plus the VG. 0 2000 4000 6000 steps system_total_stopped system_total_stopped 100 80 60 40 20 0 35 30 25 20 15 10 5 0 2000 4000 6000 steps 8000 10000 12000 14000 0 slightly better. This is due to the fact that agent at B2 receives information from the other two agents but, due to the initial exploration, such information may not be adequate, thus leading to the agent performing actions that are not eficient for its particular state. With time, this changes; here the best performance is seem within context number two, when B2 receives valuable information from the other agents.

For C2 and D2, there was no significant improvement. We are investigating two causes for this behavior. A first hypothesis is that C2 actually receives less trafic than the other intersections, in all contexts. This means that, even if the pattern is similar to B2, the magnitude of the flow is not the same. A second explanation may be related to the way the state space is coded, which could influence the way the information is communicated to other agents. Note that C2 has a vertical trafic direction that is opposite to B2, thus technically the states may be diferent.

As for D2, we only observe improvements in the third context, which means that this intersection is receiving noise information from the other agents. In a future work, we plan to filter out this noise by means of a smarter mechanism, able to put less weight on the received information, especially when it deviates too much from the already acquired one, i.e., the putting more weight on the local information.

In short, Fig. 5 shows that there is an advantage in letting virtual agents share their experiences. The performance improves to a diferent extent among the agents; some profit more than others.

6. Conclusion and future work

In this paper, we discuss how multiagent RL can be combined with transfer of experiences. The novelty of our approach lies in the fact that such transfer does not happen only among neighboring agents, but rather, among agents that have similar trafic patterns at a given time step. This is particularly useful when dealing with issues that arise when the state space is large. In such cases, a poor discretization may lead to poor performance.

We show the use of our approach using a trafic signal control scenario, where we also deal with changes in context, i.e., in the trafic flow patterns.

Our results show that, compared to a baseline stemming from standard RL, i.e., Q-learning, the proposed approach performs better in terms of stopped vehicle and also presents less oscillations.

We recall that we have considered emissions when constructing the virtual graph; however, we have used only travel time as reward. Thus, as a first next step, we plan to extend the approach to deal with more than one objective. Secondly, regarding the experiments, we intend to use scenarios in which agents are heterogeneous (e.g., they have a diferent set of actions that arise due to diferent set of phases), and investigate the issue of how states are coded and information transferred to other agents. Finally, we plan to extend the experiments in order to deal with situations where intersections require more than two phases, and also to perform experiments comparing our results to other kinds of trafic signal controllers.

Acknowledgements

This work is partially supported by CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil, Finance Code 001). Ana Bazzan is partially supported by CNPq under grant number 304932/2021-3.

[1]

Watkins , Learning from Delayed Rewards, Ph.D. thesis , University of Cambridge, 1989 .

[2]

R. P.

Roess ,

E. S.

Prassas ,

W. R.

McShane , Trafic Engineering, 3rd ed., Prentice Hall , 2004 .

[3]

A. L. C.

Bazzan , Opportunities for multiagent systems and multiagent reinforcement learning in trafic control , Autonomous Agents and Multiagent Systems 18 ( 2009 ) 342 - 375 . doi: 10 .1007/ s10458-008-9062-9.