Workshop "From Objects to Agents" (WOA 2019) A Deep Reinforcement Learning Approach to Adaptive Traffic Lights Management Andrea Vidali, Luca Crociani, Giuseppe Vizzari, Stefania Bandini CSAI - Complex Systems & Artificial Intelligence Research Center, University of Milano-Bicocca, Milano, Italy name.surname@unimib.it Abstract—Traffic monitoring and control, as well as traffic with advances in machine learning, represents an opportunity simulation, are still significant and open challenges despite the for a scientific investigation of the possibility to employ significant researches that have been carried out, especially on these virtual environments as tools to explore the outcome of artificial intelligence approaches to tackle these problems. This paper presents a Reinforcement Learning approach to traffic potential regulation actions within specific situations, within a lights control, coupled with a microscopic agent-based simulator Reinforcement Learning [1] framework. (Simulation of Urban MObility - SUMO) providing a synthetic This paper represents a contribution within this line of but realistic environment in which the exploration of the outcome research and, in particular, we focus on a simple yet still of potential regulation actions can be carried out. The paper studied situation: a single four way intersection regulated by presents the approach, within the current research landscape, then the specific experimental setting and achieved results are traffic lights, that we want to manage through an autonomous described. agent perceiving the current traffic conditions, and exploiting the experience carried out in simulated situations, possibly Index Terms—reinforcement learning, traffic lights control, traffic management, agent-based simulation representing plausibile traffic conditions. The simulations are actually also agent-based, and in particular, for this study, they have been carried out in a tool for Simulation of Urban I. I NTRODUCTION MObility (SUMO) [2] providing a synthetic but realistic envi- Traffic monitoring and control and, in general, approaches ronment in which the exploration of the outcome of potential supporting the reduction of congestion still represent hot topics regulation actions can be carried out. An important aspect is for research of different disciplines, despite the substantial the fact that SUMO provides an Application Programming researches that have been devoted to these topics. The global Interface for interfacing with external programs, therefore we phenomenon of urbanization (half of the world’s population were able to define a plausible set of observable aspects of was living in cities at the end of 2008 and it is predicted the environment, control the traffic lights according to the that by 2050 about 64% of the developing world and 86% of decisions of the learning agent, as well as also to exploit some the developed world will be urbanized1 ) is in fact constantly stastics gathered by SUMO to describe the overall traffic flow changing the situation and making it actually harder to manage and therefore to define the reward to the actions carried out such a concentration of population and transportation de- by the traffic lights control agent. mand. Technological developments among which autonomous The paper breaks down as follows: we first provide a driving represents just the most futuristic one (at least from compact description of the relevant portion of the state of a popular culture perspective), represent at the same time the art in traffic lights management with RL approaches, then attempts to tackle these issues and further challenges, in terms we introduce the experimental setting we adopted for this of potential developments whose introduction requires further study. The RL approach we defined and adopted will be given study and analysis of the potential impact and implications. in Section IV, then the achieved results will be described. Artificial Intelligence plays an important role within this Conclusions and future developments will end the paper. framework; even not considering the obvious relevance to the II. R ELATED W ORKS autonomous driving initiative, we focus here on two aspects: A. Reinforcement Learning (i) the regulation of traffic patterns, especially based on (ii) the analysis of situations by means of agent-based simulations, in One of the acceptations of the goals of AI is to develop which the behaviour of drivers and other relevant entities is machines that resemble the intelligent behavior of a human modeled and computer within a synthetic environment. The being. In order to achieve this goal, an AI system should latter, in particular, have reached a level of sufficient complex- be able to interact with the environment and learn how ity, flexibility, and they have proven their capability to support to correctly act inside it. An established area of AI that decision makers in the exploration of alternative ways to has been proved capable of experience-driven autonomous manage traffic within urban settings. On the side of regulation learning is reinforcement learning [1]. Several complex tasks of traffic patterns, the availability of these simulators, coupled were successfully completed using reinforcement learning in multiple fields, such as games [3], robotics [4], and traffic 1 https://population.un.org/wup/ signal control. 42 Workshop "From Objects to Agents" (WOA 2019) In a Reinforcement Learning (RL) problem, an autonomous 1) State representation: The state is the agent’s perception agent observes the environment and perceives a state st , of the environment in an arbitrary step. In literature, state space which is the state of the environment at time t. Then the representations particularly differ in information density. agent chooses an action at which leads to a transition of the In low information density representations, usually the inter- environment to the state st+1 . After the environment transition, section’s lanes are discretized in cells along the length of the the agent obtains a reward rt+1 which tells the agent how good lane. Lane cells are then mapped to cells of a vector, which at was with respect to a performance measure. The goal of the marks 1 if a vehicle is inside the lane cell, 0 otherwise [6]. agent is to learn the policy π ∗ that maximizes the cumulative Some approaches include additional information, adopting expected reward obtained as a result of actions taken while such a vector of car presence with the addition of a vector following π ∗ . The standard cycle of reinforcement learning is encoding the relative velocity of vehicles [7]. The current shown in Figure 1. traffic light phase could also be added as a third vector [8]. Regarding state representations with high information den- sity, usually the agent receives an image of the current situation of the whole intersection, i.e. a snapshot of the simulator being used; multiple successive snapshots will be stacked together to give the agent a sense of the vehicle motion [9]. 2) Actions representation: In the context of traffic signal control, the agent’s actions are implemented with different degrees of flexibility and they are described below. Fig. 1. The reinforcement learning cycle. Among the category of action set with low flexibility, the agent can choose among a defined set of light combinations. When an action is selected, a fixed amount of time will lasts before the agent can select a new configuration [7]. B. Learning in Traffic Signal Control Some works gave the agent more flexibility by defining phase Traffic signal control is a well suited application context for duration with variable length [10]. An agent with a higher RL techniques: in this framework, one or more autonomous flexibility chooses an action at every step of the simulation agents have the goal of maximizing the efficiency of traffic from a fixed set of light combinations. However, the selected flow that drives through one or more intersection controlled action is not activated if the minimum amount of time required by traffic lights. The use of RL for traffic signal control is to release at least a vehicle, has not passed [8], [9]. A slightly motivated by several reasons [5]: (i) if trained properly, RL different approach would be to have a defined cycle of light agents can adapt to different situations (e.g. road accidents, combinations activated into the intersection. The agent action bad weather conditions); (ii) RL agents can self-learn without is represented by the choice of when it is time to switch to supervision or prior knowledge of the environment; (iii) the the next light combination, and the decision is made at every agent only needs a simplified model of the environment step [11]. (essentially related to the state representation), since the agent 3) Reward representation: The reward is used by the agent learns using the system performance metric (i.e. the reward). to understand the effects of the latest action taken in the latest RL techniques applied to traffic signal control address the state; it is usually defined as a function of some performance following challenges: [5] indicator of the intersection efficiently, such as vehicles’ delays, queue lengths, waiting times or overall throughput. • Inappropriate traffic light sequence. Traffic lights usu- Most of the works include the calculation of the change ally choose the phases in a static, predefined policy. This between cumulative vehicle delay between actions, where the method could cause the activation of an inappropriate vehicle delay is defined as the number of seconds the vehicles traffic light phase in a situation that could cause an is steady [8], [9]. Similarly, the cumulative vehicle staying increase in travel times. time can be used, which is the number of seconds the vehicle • Inappropriate traffic light durations. Every traffic light has been steady since his entrance in the environment [7]. phase has a predefined duration which does not depend on Moreover, some works combine multiples indicators in a the current traffic conditions. This behavior could cause weighted sum [11]. unnecessary waitings for the green phase. Although the above are potential advantages of the RL approach to traffic signal control, not all of them have already C. Adopted models and learning algorithms been achieved, and (as we will show in the remander of the The most recent reinforcement learning research has pro- paper) the present approach only represents an initial step in posed multiple possible solutions to address the traffic signal this overall line of work. control problem, in which it emerges that different algorithms In order to apply a RL algorithm, it is necessary to define and neural networks structure can be used, although some the state representation, the available actions and the reward common techniques are necessary but not sufficient in order functions; in the following, we will describe the most widely to ensure a good performance. adopted approaches for the design of these elements within The most widely used algorithm to address the problem is the context of Traffic Signal Control. Q-learning. The optimal behavior of the agent is achieved with 43 Workshop "From Objects to Agents" (WOA 2019) the use of neural networks to approximate Q-values given a the passage of time is represented in simulation steps. But the state. Often, this approach includes a Convolutional Neural agent only operates at certain steps, after the environment has Network (CNN) to compute the environment state and learn evolved enough. Therefore, in this paper every step dedicated features from an image [9] or a spatial representation [8], [7]. to the agent’s workflow is called agentstep, while the steps Genders and Ravi [8] and Gao et al. [7] make use of a dedicated to the simulation are simply called ”steps”. Hence, Convolutional Neural Network to learn features from their after a certain amount of simulation steps, the agent starts spatial representation of the environment. The output of this its sequence of operations by gathering the current state network with the current phase is passed to two fully con- of the environment. Also, the agent calculates the reward nected layers that connect to the outputs represented by Q- of the previous selected action, using some measure of the values. This method showed good results in [7] work against current traffic situation. The sample of data containing every different traffic lights policies, such as long-queue-first and information about the latest simulation steps is saved to a fixed-times, while in [8] it is compared to a shallow neural memory and later extracted for a training session. Now the network, in which (although it shows a good performance) an agent is ready to select a new action based on the current evaluation against real-world traffic lights would lead to more state of the environment, which will resume the simulation significant results. until the next agent interaction. Mousavi et al. [9] analyzed a double approach to address the traffic signal control problem. The first approach is value- based, while the second is policy-based. In the first approach, action values are predicted by minimizing the mean-squared error of Q-values with the stochastic gradient-descent method. In the alternative approach, the policy is learned by updating the policy parameters in such a way that the probability of taking good actions increases. A CNN is used as a func- tion approximator to extract features from the image of the intersection, wherein the value-based approach the output is the value of actions, and in the policy-based approach it is a probability distribution over actions. Results show that both the approaches achieve good performance against a defined baseline and do not suffer from instability issues. In [10], a deep stacked autoencoders (SAE) neural network Fig. 2. The agent’s workflow. is used to learn Q-values. This approach uses autoencoders to minimize the error between the encoder neural network Q- The environment where the agent acts is represented in value prediction and the target Q-value by using a specific Figure 3. It is a 4-way intersection where 4 lanes per arm loss function. It is shown that achieves better performance approach the intersection from the compass directions, leading than traditional RL methods. to 4 lanes per arm leaving the intersection. Each arm is 750 meters long. On every arm, each lane defines the possible directions that a vehicle can follow: the right-most lane enable III. E XPERIMENTAL S ETTING vehicles to turn right or going straight, the two central lanes The traffic microsimulator used for this research is Simu- bound the driver to go straight while on the left-most lane lation of Urban MObility (SUMO) [12]. SUMO provides a the left turn is the only direction allowed. In the center of software package which includes an infrastructure editor, a the intersection, a traffic light system, controlled by the agent, simulator interface and an application programming interface manages the approaching traffic. In particular, on every arm the (API). These elements enable the user to design and implement left-most lane has a dedicated traffic light, while the other three custom configurations and functionalities of a road infrastruc- lanes share a traffic light. Every traffic light in the environment ture and exchange data during the traffic simulation. operates according to the common european regulations, with In this research, the chance of improvement in traffic flow the only exception being the absence of time between the end that drives through an intersection controlled by traffic lights of a yellow phase and the start of the next green phase. In this will be investigated using artificial intelligence techniques. The environment pedestrians, sidewalks and pedestrian crossings agent is represented by the traffic light system that interacts are not included. with the environment in order to maximize a certain measure of traffic efficiency. Given this general premise, the problem A. Training setup and traffic generation tackled in this paper is defined as follows: given the state of The entire training is divided in multiple episodes .The total the intersection, what is the traffic light phase that the agent number of episodes is 300. By default, SUMO provides a time should choose, selected from a fixed set of predefined actions, frequency of 1 second per step, and the period of each episode in order to maximize the reward and consequently optimize is set at 1 hour and 30 minutes, therefore the total number of the traffic efficiency of the intersection. steps per episode is equal to 5400. 300 episodes of 1.30 hours The typical workflow of the agent is shown in Figure 2. each are equivalent to almost 19 days of continuous traffic, and It should be underlined that in this application with SUMO, the entire training takes about 6 hours on a high-end laptop. 44 Workshop "From Objects to Agents" (WOA 2019) from every arm evenly distributed. Then, 75 % of gener- ated cars will go straight and 25 % of cars will turn left or right at the intersection. • Low-traffic scenario. 600 cars approach the intersection from every arm evenly distributed. Then, 75 % of gener- ated cars will go straight and 25 % of cars will turn left or right at the intersection. • NS-traffic scenario. 2000 cars approach the intersection, with 90 % of them coming from the North or South arm. Then, 75 % of generated cars will go straight and 25 % of cars will turn left or right at the intersection. • EW-traffic scenario. 2000 cars approach the intersection, with 90 % of them coming from the East or West arm. Then, 75 % of generated cars will go straight and 25 % of cars will turn left or right at the intersection. Each scenario corresponds to one single episode and they cycle during the training always in the same order. Fig. 3. The environment. IV. D ESCRIPTION OF THE R EINFORCEMENT L EARNING A PPROACH In a simulated intersection, the traffic generation is a crucial In order to design a system based on the reinforcement part that can have a big impact on the agents performance. In learning framework, it is necessary to define the state rep- order to maintain a high degree of reality, in each episode the resentation, the action set, the reward function and the agent traffic will be generated according to a Weibull distribution learning techniques involved. It should be noted that the such with a shape equal to 2. An example is shown in Figure 4. agent’s elements in this paper are easily replaceable with a The distribution is presented in the form of a histogram, where traffic monitoring system in a real world appliance, compared the steps of one simulation episode are defined on the x-axis to others relevant studies in this topic which have higher and the number of vehicles generated in that step window is requirements in terms of technical feasibility. defined on the y-axis. The Weibull distribution approximates specific traffic situations, where during the early stage the A. State representation number of cars is rising, representing a peak hour. Then, the number of incoming cars slowly decreases describing the The state of the agent describes a representation of the gradual mitigation of traffic congestion. Also, every vehicles situation of the environment in a given agentstep t and it generated has the same physical dimensions and performance. is usually denoted with st . To allow the agent to effectively learn to optimize the traffic, the state should provide sufficient information about the distribution of cars on each road. The objective of the chosen representation is to let the agent knows the position of vehicles inside the environment at agentstep t. For this purpose the approach proposed in this paper is inspired to the DTSE [8], with the difference that less information is encoded in this state. In particular, this state design includes only spatial information about the vehicles hosted inside the environment, and the cells used to discretize the continuous environment are not regular. The chosen design for the state representation is focused on realism: recent works on traffic signal controller proposed information-rich states, Fig. 4. Traffic generation distribution over a single episode. but in reality they hard to implement since the information used in that kind of representations is difficult to gather. The traffic distribution described provides the exact step Therefore, in this paper will be investigated the chance of of the episode when a vehicle will be generated. For every obtaining good results with a simple and easy-to-apply state vehicle scheduled, its source arm and destination arm are representation. determined using a random number generator which have a Technically, in each arm of the intersection incoming lanes different seed in every episode, so it is not possible to have two are discretized in cells that can identify the presence or absence equivalent episodes. In order to obtain a true adaptive agent, of a vehicle inside them. In Figure 5 is showed the state the simulation should include a significant variety of traffic representation for the west arm of the intersection. Between flows and patterns [13]. Therefore, four different scenarios are the beginning of the road and the intersection’s stop line, there defined and they are the following. are 20 cells. 10 of them are located along the left-only lane • High-traffic scenario. 4000 cars approach the intersection while the others 10 cover the others three lanes. Therefore, in 45 Workshop "From Objects to Agents" (WOA 2019) Fig. 5. Design of the state representation in the west arm of the intersection, Fig. 6. Graphical representation of the four possible actions. with cells length. action, a 4 seconds yellow phase is initiated between the the whole intersection there are 80 cells. Not every cell has the two actions. This means that the number of simulation steps same size: the further the cell is from the stop line, the longer between two same actions is 10, since 1 simulation step is it is, so more lane length is covered. The choice of the length equal to 1 second in SUMO. When the two consecutive actions of every cell is not trivial: if cells were too long, some cars are different, the yellow phase counts as 4 extra simulation approaching the crossing line may not be detected; if cells steps and therefore the total number of simulation steps in were too short, the number of states required to cover the between actions is 14. Figure 7 shows a brief scheme of this length of the lane increases, bringing to higher computational process. complexity. In this paper, the length of the shortest cells, which are also the closest to the stop line, is exactly 2 meters longer than the length of a car. In summary, whenever the agent observe the state of the environment, he will obtain the set of cells that describe the presence or absence of vehicles in every area of the incoming roads. B. Action set Fig. 7. Possible differences of simulation steps between actions. The action set identifies the possible actions that the agent can take. The agent is the traffic light system, so doing an action translates to activate a green phase for a set of lanes C. Reward function for a fixed amount of time, choosing from a predefined set of green phases. In this paper, the green time is set at 10 seconds In reinforcement learning, the reward represents the feed- and the yellow time is set at 4 seconds. Formally, the action back from the environment after the agent has chosen an space is defined in the set (1). The set includes every possible action. The agent uses the reward to understand the result action that the agent can take. of the taken action and improve the model for future action choices. Therefore, the reward is a crucial aspect of the learning process. The reward usually has two possible values: A = {NSA, NSLA, EWA, EWLA} (1) positive or negative. A positive reward is generated as a Every action of set (1) is described below. consequence of good actions, a negative reward is generated • North-South Advance (NSA): the green phase is active from bad actions. In this application, the objective is to for vehicles that are in the north and south arm and wants maximize the traffic flow through the intersection over time. In to proceed straight or turn right. order to achieve this goal, the reward should be derived from • North-South Left Advance (NSLA): the green phase is some performance measure of traffic efficiency, so the agent active for vehicles that are in the north and south arm is able to understand if the taken action reduce or increase the and wants to turn left. intersection efficiency. In traffic analysis, several measures are • East-West Advance (EWA): the green phase is active for used [14], such as throughput, mean delay and travel time. In vehicles that are in the east and west arm and wants to this paper, two reward functions are presented which use two proceed straight or turn right. slightly different traffic measures, and they are the following. • East-West Left Advance (EWLA): the green phase is 1) Literature reward function: The first reward function is active for vehicles that are in the east and west arm and called literature because it is inspired to similar studies in this wants to turn left. topic. The literature reward function uses as a metric the total waiting time, defined as in equation (2). Figure 6 shows a graphical representation of the four possible actions. X n If the action chosen in agentstep t is the same as the twtt = wt(veh,t) (2) action taken in the last agentstep t − 1 (i.e. the traffic light veh=1 combination is the same), there is no yellow phase and Where wt(veh,t) is the amount of time in seconds a vehicle veh therefore the current green phase persists. On the contrary, has a speed of less than 0.1 m/s at agentstep t. n represents if the action chosen in agentstep t is not equal to the previous the total number of vehicles in the environment in agentstep t. 46 Workshop "From Objects to Agents" (WOA 2019) Therefore, twtt is the total waiting time at agentstep t. From the cars in the intersection captured respectively at agentstep this metric, the literature reward function can be defined as a t and t − 1. function of twtt and is shown in (3) D. Deep Q-Learning rt = 0.9 · twtt−1 − twtt (3) The learning mechanism involved in this paper is called Where rt represents the reward at agentstep t. twtt and Deep Q-Learning, which is a combination of two aspects twtt−1 represent the total waiting time of all the cars in the widely adopted in the field of reinforcement learning: deep intersection captured respectively at agentstep t and t − 1. The neural networks and Q-Learning. Q-Learning [15] is a form of parameter 0.9 helps with the stability of the training process. model-free reinforcement learning [16]. It consists of assigning In a reinforcement learning application, the reward usually a value, called the Q-value, to an action taken from a precise can be positive or negative, and this implementation is no state of the environment. Formally, in literature, a Q-value is exception. The equation 3 is designed in such a way that defined as in equation (6). when the agent chooses a bad action it returns a negative value and when it chooses a good action it returns a positive Q(st , at ) = Q(st , at )+α(rt+1 +γ·maxA Q(st+1 , at )−Q(st , at )) value. A bad action can be represented as an action that, in the (6) current agentstep t, adds more vehicles in queues compared where Q(st , at ) is the value of the action at taken from state to the situation in the previous agentstep t − 1, resulting in st . The equation consists on updating the current Q-value higher waiting times compared to the previous agentstep. This with a quantity discounted by the learning rate α. Inside the behavior increases the twt for the current agentstep t and parenthesis, the term rt+1 represents the reward associated to consequently the equation 3 assumes a negative value. The taking action at from state st . The subscript t + 1 is used to more vehicles were added in queues for the agentstep t, the emphasize the temporal relationship between taking the action more negative rt will be and therefore the worst the action at and receiving the consequent reward. The term Q(st+1 , at ) will be evaluated by the agent. The same concept is applied represents the immediate future’s Q-value, where st+1 is next for good actions. state in which the environment has evolved after taking action The problem with this reward function lays inside the at in state st . The expression maxA means that, among the choiche of the metric, and happens when the following situa- possible actions at in state st+1 , the most valuable is selected. tion arise. During the High-traffic scenario, very long queues The term γ is the discount factor that assumes a value between appears. When the agent activate the green phase for a long 0 and 1, lowering the importance of future reward compared queue, the departure of cars creates a wave of movement to the immediate reward. that traverse the entire queue. The reward associated to this In this paper, a slightly different version of the equation (6) phase activation is received not only in the next agentstep, is used and it is presented in equation (7). This will be called as it should, but also in very next ones. That is because the the Q-learning function from this point. movement wave persists longer compared to the delta step Q(st , at ) = rt+1 + γ · maxA Q′ (st+1 , at+1 ) (7) between actionstep, and the wave triggers the waiting times of cars in the queue, misleading the agent about the reward Where the reward rt+1 is the reward received after taking received. action at in state st . The term Q′ (st+1 , at+1 ) is the Q-value 2) Alternative reward function: The alternative reward associated with taking action at+1 in state st+1 , i.e. the next function uses a metric that is slightly different from the former state after taking action at in state st . As seen in equation (6), metric, which is the accumulated total waiting time, defined the discount factor γ denote a small penalization of the future in equation (4). reward compared to the immediate reward. Once the agent X n is trained, the best action at taken from state st will be the atwtt = awt(veh,t) (4) one that maximize the function Q(st , at ). In other words, veh=1 maximizing the Q-learning function means following the best Where awt(veh,t) is the amount of time in seconds a vehicle strategy that the agent have learned. veh has a speed of less than 0.1 m/s at agentstep t, since the In a reinforcement learning application, often the state space spawn into the environment. n represents the total number of is so large that is impractical to discover and save every state- vehicles in the environment in agentstep t. Therefore, atwtt action pair. Therefore, the Q-learning function is approximated is the accumulated total waiting time at agentstep t. With using a neural network. In this paper, a fully connected deep this metric, when the vehicle departs but it does not manage neural network is used, which is composed of an input layer of to cross the intersection, the value of atwtt does not resets 80 neurons, 5 hidden layers of 400 neurons each with rectified (unlike the value of twtt ), avoiding the misleading reward linear unit (ReLU) [17] and the output layer with 4 neurons associated with the literature reward function, when a long with linear activation function, each one representing the value queue build up at the intersection. Once the metric is set, the of an action given a state. A graphical representation of the alternative reward function is defined such as in equation (5) deep neural network is showed in Figure 8 rt = atwtt−1 − atwtt (5) E. The training process Where rt represents the reward at agentstep t. atwtt and Experience replay [18] is a technique adopted during the atwtt−1 represent the accumulated total waiting time of all training phase in order to improve the performance of the 47 Workshop "From Objects to Agents" (WOA 2019) 1) Prediction of the Q-values Q(st ), which is the current knowledge that the agent has about the action values from st . 2) Prediction of the Q-values Q′ (st+1 ). These represents the knowledge of the agent about the action values starting from the state st+1 . 3) Update of Q(st , at ) which represents the value of the par- ticular action at selected by the agent during the simula- tion. This value is overwritten using the Q-learning func- tion described in equation (7). The element rt+1 is the Fig. 8. Scheme of the deep neural network. reward associated to the action at , maxA Q′ (st+1 , at+1 ) is obtained using the prediction of Q′ (st+1 ) and repre- sents the maximum expected future reward i.e. the higher agent and the learning efficiency. It consists of submitting action value expected by the agent, starting from state to the agent the information needed for learning in the form st+1 . It will be discounted by a factor γ that gives more of a randomized group of samples called batch, instead of importance to the immediate reward. immediately submitting the information that the agent gather 4) Training of the neural network. The input is the state st , during the simulation (commonly called Online Learning). The while the desired output is the updated Q-values Q(st , at ) batch is taken from a data structure intuitively called memory, that now includes the maximum expected future reward which stores every sample collected during the training phase. thanks to the Q-value update. A sample m is formally defined as the quadruple (8). Once the deep neural network has sufficiently approximated the Q-learning function, the best traffic efficiency is achieved m = {st , at , rt+1 , st+1 } (8) by selecting the action with the highest value given the Where rt+1 is the reward received after taking the action at current state. A major problem in any reinforcement learning from state st , which evolves the environment into the next state task is the action-selection policy while learning; whether st+1 . This technique is implemented to remove correlations in to take exploratory action and potentially learn more, or to the observation sequence, since the state of the environment take exploitative action and attempt to optimize the current st+1 is a direct evolution of the state st and the correlation knowledge about the environment evolution. In this paper the can decrease the training capability of the agent. In Figure 9 ǫ-greedy exploration policy is chosen, and it is represented is shown a representation of the data collection task. by the equation (9). It defines a probability ǫ for the current episode h to choose an explorative action, and consequently a probability 1 − ǫ to choose an exploitative action. h ǫh = 1 − (9) H where h is the current episode of training and E is the total number of episodes. Initially, ǫ = 1, meaning that the agent exclusively explores. However, as training progresses, the agent increasingly exploits what it has learned, until it exclusively exploits. V. S IMULATION R ESULTS Fig. 9. Scheme of the data collection. The performance of the agents is assessed in two parts: As stated earlier, the experience replay technique needs initially, the reward trend during the training is analyzed. Then, a memory, which is characterized by a memory size and a a comparison between the agents and a static traffic light is batch size. The memory size represents how many samples discussed, with respect to common traffic metrics, such as the memory can store and is set at 50000 samples. The batch cumulative wait time and average wait time per vehicle. size is defined as the number of samples that are retrieved One agent is trained using the literature reward function, from the memory in one training instance and is set at 100. If while the other one adopts the alternative reward function. at a certain agentstep the memory is filled, the oldest sample Figure 10 shows the learning improvement during the training is removed to make space for the new sample. in the Low-traffic scenario of both agents, in term of cumu- A training istance consists of learning the Q-value function lative negative reward i.e the magnitude of actions’ negative iteratively using the information contained in the batch of outcomes during each episode. As it can be seen, each agent samples extracted. Every sample in the batch is used for train- has learned a sufficiently correct policy in the Low-traffic ing. From the standpoint of a single sample, which contains scenario. As the training proceeded, both agents efficiently the elements {st , at , rt+1 , st+1 }, the following operations are explore the environment and learn an adequate approximation executed: of the Q-values; then, towards the end of the training, they 48 Workshop "From Objects to Agents" (WOA 2019) try to optimize the Q-values by exploiting the knowledge on real-world static traffic lights [19]. In particular, the phases learnined so far. The fact that the agent with the alternative NSA and EWA lasts 30 seconds, the phases NSLA and EWLA reward function has a better reward curve overall is not a lasts 15 seconds and the yellow phase is the same as the agent, strong evidence of a better performance, since haveing two which is 4 seconds. different reward functions means that different reward values In Table I are shown the performance of the two agents, are produced. The performance difference will be discussed compared to the STL. The metric used to measure the per- later during the static traffic light benchmark. formance difference are the cumulative wait time and the average wait time per vehicle. The cumulative wait time is defined as the sum of all waiting times of every car during the episode, while the average waiting time per vehicle is defined as the average amount of seconds spent by a vehicle in a steady position during the episode. These measures are gathered across 5 episodes and then averaged. Literature reward Alternative reward agent agent Low-traffic scenario cwt -30 -47 awt/v -29 -45 Fig. 10. Cumulative negative reward of both agents per episode during the High-traffic scenario training in the Low-traffic scenario. cwt +145 +26 Figure 11 shows the same training data as Figure 10, but awt/v +136 +25 referred to the High-traffic scenario. In this scenario, the agent NS-traffic scenario with the literature reward shows a significantly unstable reward cwt -50 -62 curve, while the other agent’s trend is stable. This behavior is awt/v -47 -56 caused by the choiche of using the waiting time of vehicles as a EW-traffic scenario metric for the reward function, which in situations with long cwt -65 -65 queues causes the aquisition of misleading rewards. In fact, by using the accumulated waiting time like in the alternative awt/v -59 -58 reward function, vehicles does not resets their waiting times TABLE I by simply advancing through the queue. As Figure 11 shows, AGENTS PERFORMANCE OVERVIEW, PERCENTAGE VARIATIONS COMPARED TO STL ( LOWER IS BETTER ). the alternative reward function produces a more stable policy. In the NS-traffic and EW-traffic scenarios, both agents perform well since it is a simpler task to exploit. In general, the alternative reward agent achieves a better traffic efficiency compared to the literature agent: this is a consequence of the adoption of a reward function (accumu- lated waiting time) that more properly discounts waiting times exceeding a single traffic light cycle. Considering just the waiting time starting from the last stop of the vehicles, leads to not sufficiently emphasize the usefulness of keeping longer light cycles, introducing too many yellow lights situations and changes, that are effective in low or medium traffic situations. The fact that the agent is more effectine in low to medium traffic situations, leads to think that an easy and almost immediate opportunity would be to separately develop agents devoted to different traffic situations, having a sort of controller that monitors the traffic flow and that selects the Fig. 11. Cumulative negative reward of both agents per episode during the training in the High-traffic scenario. most appropriate agent configuration. This experimentation also leads to consider that, however, additional improvements In order to truly analyze which agent achieve better perfor- would be possible by (i) improving the learning approach mance, a comparison between the agents and a Static Traffic to achieve a more stable and faster convergence, (ii) further Light (STL) is presented. The STL has the same layout of improving the reward fuction to better describe the desired the agents and it cycle through the 4 phases always in the behaviour and to influence the average cycle lengths, that is following order: [NSA − NSLA − EWA − EWLA]. Moreover, more fruitfully short in low traffic situation and long whenever every phase has a fixed duration and they are inspired by those the traffic condition worsens. 49 Workshop "From Objects to Agents" (WOA 2019) VI. C ONCLUSIONS AND F UTURE D EVELOPMENTS the Real-Time Strategy Game StarCraft II,” https://deepmind.com/blog/ alphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019. This work has presented a believable exploration of the [4] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, plausibility of a RL approach to the problem of traffic lights D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Qt-opt: adaptation and management. The work has employed a real- Scalable deep reinforcement learning for vision-based robotic manipu- lation,” arXiv preprint arXiv: 1806.10293, 2018. istic and validated traffic simulator to provide an environment [5] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P. Komisarczuk, in which training and evaluating a RL agent. Two metrics for “A survey on reinforcement learning models and algorithms for traffic the reward of agent’ actions have been investigated, clarifying signal control,” ACM Computing Surveys (CSUR), vol. 50, no. 3, p. 34, 2017. that a proper decription of the application context is just [6] W. Genders and S. Razavi, “Evaluating reinforcement learning state as important as the competence in the proper application of representations for adaptive traffic signal control,” Procedia computer machine learning approaches for achieving proper results. science, vol. 130, pp. 26–33, 2018. [7] J. Gao, Y. Shen, J. Liu, M. Ito, and N. Shiratori, “Adaptive traffic signal Future works are aimed at further improving achieved control: Deep reinforcement learning algorithm with experience replay results, but also, within a longer term, at investigating what and target network,” arXiv preprint arXiv:1705.02755, 2017. would be the implications of introducing mutiple RL agents [8] W. Genders and S. Razavi, “Using a deep reinforcement learning agent for traffic signal control,” arXiv preprint arXiv:1611.01142, 2016. within a road network and what would be the possiblity to [9] S. S. Mousavi, M. Schukat, and E. Howley, “Traffic light control using coordinate their efforts for achieving global improvements deep policy-gradient and value-function-based reinforcement learning,” over local ones, and also the implications on the vehicle IET Intelligent Transport Systems, vol. 11, no. 7, pp. 417–423, 2017. [10] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforce- population, that could perceive the change in the infrastructure ment learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3, and adapt in turn to exploit additional opportunities and poten- pp. 247–254, 2016. tially negating the achieved improvements due to an additional [11] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement traffic demand on the improved intersections. It is important learning approach for intelligent traffic light control,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge to perform analyses along this line of work to understand the Discovery & Data Mining. ACM, 2018, pp. 2496–2505. plausibility, potential advantages or even unintended negative [12] D. Krajzewicz, G. Hertkorn, C. Rössel, and P. Wagner, “Sumo (simula- implications of the introduction in the real world of this form tion of urban mobility)-an open-source traffic simulation,” in Proceed- ings of the 4th middle East Symposium on Simulation and Modelling ofself-adaptive system. (MESM20002), 2002, pp. 183–187. [13] L. A. Rodegerdts, B. Nevers, B. Robinson, J. Ringert, P. Koonce, J. Bansen, T. Nguyen, J. McGill, D. Stewart, J. Suggett et al., “Signalized R EFERENCES intersections: informational guide,” Tech. Rep., 2004. [1] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning. [14] R. Dowling, “Traffic analysis toolbox volume vi: Definition, interpreta- MIT press Cambridge, 1998, vol. 135. tion, and calculation of traffic analysis tools measures of effectiveness,” [2] M. Behrisch, L. Bieker, J. Erdmann, and D. Krajzewicz, “Sumo – Tech. Rep., 2007. simulation of urban mobility: An overview,” in SIMUL 2011, S. . U. [15] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. of Oslo Aida Omerovic, R. I. R. T. P. D. A. Simoni, and R. I. R. T. 3-4, pp. 279–292, 1992. P. G. Bobashev, Eds. ThinkMind, October 2011. [Online]. Available: [16] C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. disserta- https://elib.dlr.de/71460/ tion, King’s College, Cambridge, 1989. [3] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M. [17] J. N. Tsitsiklis and B. Van Roy, “Analysis of temporal- diffference learn- Czarnecki, A. Dudzik, A. Huang, P. Georgiev, R. Powell, T. Ewalds, ing with function approximation,” in Advances in neural information D. Horgan, M. Kroiss, I. Danihelka, J. Agapiou, J. Oh, V. Dalibard, processing systems, 1997, pp. 1075–1081. D. Choi, L. Sifre, Y. Sulsky, S. Vezhnevets, J. Molloy, T. Cai, D. Budden, [18] L.-J. LIN, “Reinforcement learning for robots using neural networks,” T. Paine, C. Gulcehre, Z. Wang, T. Pfaff, T. Pohlen, Y. Wu, D. Yogatama, Ph.D. thesis, Carnegie Mellon University, 1993. J. Cohen, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, C. Apps, [19] P. Koonce and L. Rodegerdts, “Traffic signal timing manual.” United K. Kavukcuoglu, D. Hassabis, and D. Silver, “AlphaStar: Mastering States. Federal Highway Administration, Tech. Rep., 2008. 50