-

Learning Cooperative Policy among Self-Driving Vehicles for Relieving Traffic Jams

Shota Ishikawa

Sachiyo Arai

arai@tu.chiba-u.ac.jp 0 0 Chiba University 1-33 Yayoi-cho Inage-ku Chiba , Japan 263-8522

We propose a novel driving policy which is a velocity control for self-driving vehicles to relieve traffic jams. Although the driving policy in previous research was empirically designed for a given traffic situation, which meant that the driving policy required to be reconfigured for every traffic situation and every change in traffic, we propose a driving policy that is learned by a learning agent via reinforcement learning using the data collected from the self-driving vehicles during simulation. The driving policy is relayed to the smart vehicles, which, in turn, are guided by the driving policy. To test and evaluate our proposed driving policy, we conducted traffic flow simulations with manually driven and self-driving vehicles in several scenarios wherein the two key parameters, vehicle density and self-driving vehicle penetration rate, are assigned different values. Our findings show that a driving policy for self-driving vehicles does relieve traffic jams in conditions such as (1) when the vehicle density is 42 vehicles/km and the penetration of the self-driving vehicle is 10% of the total traffic, and (2) when the vehicle density is 50 vehicles/km and the penetration of the self-driving vehicle is 70% of the total traffic (at which point traffic flow is nearly optimized). In addition, we found that intervehicle communication among self-driving vehicles provides real-time traffic information that relieves traffic jam even more effectively.

The self-driving vehicle is equipped with smart functions, such as an adaptive cruise control (ACC) or cooperative adaptive cruise control (CACC) that can penetrate and potentially influence traffic flow. An ACC-equipped vehicle can automatically detect the leading vehicle and can control velocity using sensor and radar instruments. A CACC-equipped vehicle can receive driving information from the vehicle preceding it via vehicle-to-vehicle (V2V) communication. Some papers have proposed a driving policy of ACC and CACC to relieve traffic jams. For example, Kesting et al. [Kesting et al., 2008] proposed the driving policy of ACC, and Forster et al. [Forster et al., 2014] proposed the driving policy of CACC. Detecting traffic condition, these vehicles drive flexibility and improve traffic flow stability.

However, the current practice of designing a driving policy is challenging as the driving policy must account for any number of traffic situations (road structures, traffic regulations, etc.), consider perturbations induced by manually driven vehicles, and direct and coordinate self-driving vehicles. Designing driving policies requires simulation trial-anderror, is labor intensive, and is time consuming.

We propose the driving policy that is learned by a learner agent via reinforcement learning using data that are collected from the self-driving vehicles. In the proposed approach, a learner agent for the driving policy simultaneously interacts with the all self-driving vehicles in traffic simulation. Collecting driving data of the self-driving vehicles that obey the driving policy, the learner agent learns the driving policy from driving data. After this interaction repeats, the learner agent acquires the driving policy. To validate the proposed approach, we introduce self-driving vehicles equipped with driving policy into traffic jam simulations induced by perturbation of a manually driven vehicle. Several traffic situations having different vehicle densities and self-driving vehicle penetration rates were used in the simulation. The effectiveness of the driving policy on relieving traffic jam was measured based on the amount of increase in traffic flow.

The rest of this paper is organized as follows. In Section 2, we discuss our approach to relieving traffic jams by means of a learner agent that learns and updates the driving policy through data collected from self-driving vehicles. In Section 3, we describe a traffic problem scenario. In Section 4, we propose a framework for learning the driving policy by a learner agent. In Section 5, we describe a Generalized Nagel–Schreckenberg (GNS) model of traffic flow for manually driven and self-driving vehicles. In Section 6, we describe the traffic simulation experiments conducted based on our proposed approach. In Section 7, we conclude this paper with remarks on future work. 2

Related Work

The proposed approach aims at generating a driving policy with data collected from self-driving vehicles and reinforcement learning of the driving policy by a learner agent. This approach is based on works related to traffic flow control in terms of driving policy and reinforcement learning.

To prevent traffic jams caused by the perturbation of a manually driven vehicle, the vehicle must be able to maintain an appropriate gap distance between itself and the preceding vehicles to prevent perturbation from propagating downstream to eventually be reflected upstream. Research has been done on the effect of maintaining an appropriate gap between vehicles for relieving traffic jam when one vehicle, all vehicles, or some vehicles are regulated by a driving policy [Kamal et al., 2014; Forster et al., 2012; Papacharalampous et al., 2015].

The driving policy for a manually driven vehicle may include predicting a traffic situation using inter-vehicle communication and recommending that the driver keep an appropriate amount of distance [Knorr et al., 2012]. Kesting et al. and Forster et al. proposed a driving policy for an ACC and CACC self-driving vehicle that adapts to a traffic situation, respectively [Kesting et al., 2008; Forster et al., 2014]. Won et al. proposed fuzzy inference systems that effectively capture the dynamics of traffic jams [Won et al., 2017]. Although these approaches are effective ways of relieving traffic jams, designing a driving policy that anticipates various traffic scenarios is difficult. We propose an approach that uses a learner agent to learn the driving policy in order to cut down on designing the policy.

Research on reinforcement learning for traffic flow optimization includes finding policies dictating how speed limits should be assigned to highway sections [Walraven et al., 2016] and controlling ramp metering devices with Q-learning [Rezaee et al., 2012]. For advanced reinforcement learning approaches, a multi-objective reinforcement learning involves learning the traffic signal policy [Khamis and Gomaa, 2014], and multi-agent reinforcement learning determines the route planning [Zolfpour-Arokhlo et al., 2014]. In contrast, our approach acquires the driving policy of the self-driving vehicles. 3

Traffic Problem Scenario

Figure 1 shows a traffic scenario involving two roadto-vehicle communication infrastructures (R2Vs), N selfdriving vehicles, and M manually driven vehicles. The R2Vs, which share information on the number of self-driving or manually driven vehicles passed by it, are installed at the edge of a road section having length L. The R2Vs can calculate the traffic density and the penetration rate of the self-driving vehicle of the road section. The upstream R2V sends the drivself-driving vehicles manually-driven vehicles

Road-to-Vehicle Communication ing policy ; corresponding to and to the self-driving vehicles passed by it.

We propose a solution to relieving traffic jams on the road by instituting driving policy ; , wherein the selfdriving vehicle complies with driving policy ; ; that is, the self-driving vehicle observes a state s, and performs action output a expressed as ; (s) = a. The state s is a six-dimensional vector s =(ϕvel;ϕgap;ϕrel;ϕc d;ϕc v;ϕc g), where ϕvel;ϕgap;ϕrel;ϕc d;ϕc v; and ϕc g indicate velocity, gap, relative velocity, communication distance between communication partners, communication partner’s velocity, and communication partner’s gap, respectively. The action a is velocity control. The state s contains the information about preceding vehicle, and the driving policy is cooperative policy to relieving traffic jams. 4

Framework for the Reinforcement Learning of Driving Policy

The reinforcement learning framework shown in Figure 2 comprises the traffic environment and the learner agent.

Environment

The traffic environment comprises self-driving and manually driven vehicles on a road characterized by periodic-boundary conditions. Because the number of vehicles is constant, vehicle density and penetration rate are also constant. The learner agent therefore learns the driving policy ; by interacting with a traffic environment in which vehicle density and penetration rate are constant.

Learner agent

We explain a procedure that the learner agent updates the driving policy whenever time t is updated from t to t + 1. At time t, the learner agent delivers the driving policy t; ; to all self-driving vehicles. Following equation (1), the driving policy outputs randomly selected action with probability ϵ or action a′ selected by argmaxa′ Q ; (s; a′) with 1 ϵ. Here, the probability ϵ = fϵj0 ϵ 1g is a parameter used to explore a new state, and Q ; (s; a) is an action value function when the vehicle state and action are, respectively, s and a. After all vehicles drive, at time t + 1, the self-driving vehicles observe the next state st+1 and receive a reward rt+1. The learner agent then collects the driving data n = fst; at; st+1; rt+1g from the self-driving vehicle. ; (s) = { random select a argmaxa′ Q ; (s; a′) , get

Manually-driven

Self-driving Learner

Update

Following Algorithm 1, the learner agent updates the driving policy using dataset D = f njn 2 N g. First, the learner agent inserts an action value Q ; into a Qn;ew. Second, the learner agent updates the Qn;ew N times. The index n of the most upstream self-driving vehicle is 1 and this index is incremented by 1 from upstream to downstream. The Qn;ew is updated by the equation at line 4 in Algorithm 1. Finally, the learner agent inserts the Qn;ew into the action value Q ; .

The equation in line 4 in Algorithm 1 is based on Qlearning [Sutton and Barto, 1998]. Here, is the learning rate, and is the discount factor. The learning rate is a parameter indicating, in degrees, the update of the action value, and the discount factor is a parameter that determines the current value of a reward expected to be obtained in the future. The self-driving vehicle accepts the reward according to its own state. The learning agent determines the driving policy that maximizes the action value that is the sum of rewards r discounted by at each time t. 5

Simulation Modeling

In this study, we used a modified Generalized Nagel– Schreckenberg (GNS) model [Ishikawa and Arai, 2015] 1. The N aSch model [Nagel and Schreckenberg, 1992], which is the basic cellular automaton for the description of traffic flow, can model the perturbation of each vehicle. The GNS is used to model the number of communication partners nicom and the maximum communication distance dicom. 5.1

Terminology

Figure 3 shows a notation of the GNS. The cellular automaton model reproduces the traffic flow which is characterized 1The point of modification and driving rule are provided in the appendix by a series of cells that indicate whether a vehicle occupies or does not occupy the cell. Vehicle i + 1 is ahead of vehicle i, as the vehicle index is incremented by one. xi, gi, vi(t), and vrel(t) indicate the coordinate, gap, velocity, and relative velocity, respectively. The self-driving vehicle i (white car) is able to communicate with the preceding white i + 2 (white car) within the given maximum communication distance dcom. icom, di, gicom , and vicom (t) indicate the index of i the communication partner, the communication distance, the gap that the communication partner possesses, and the velocity of the communication partner, respectively.

Road model

The GNS reproduces the road section along length L. The road section contains the perturbation section along length l(0 l L) in which the manually driven vehicle decelerates at probability p. The occurrence of a traffic jam is due to the deceleration of the manually driven vehicle within the perturbation section [Sugiyama et al., 2008], which corresponds to a sag or tunnel in the real world environment.

Vehicle model

The GNS parameters for the manually driven vehicle, ACC self-driving vehicle, and CACC self-driving vehicle are shown in Table 1. The GNS parameters are set at a probability of perturbation p, a probability of driving policy ppol, the number of communication partners ncom, and the maximum i communication distance dicom. The manually driven vehicle decelerates with probability p in the perturbation section, but the self-driving vehicle does not decelerate. The policydriven self-driving vehicle decelerates at probability ppol with velocity control on any section of the road. The CACC selfdriving vehicle has 1 nicom communication partners, and the ACC or CACC self-driving vehicle has a maximum communication distance of 1 dicom.

Meta stable phase Free flow phase

Critical density

Jam phase Experiment in these densities Generally, traffic flow analysis focuses on the relationship between traffic flow and vehicle density as shown in Figure 4 using a GNS model diagram comprising traffic flow plots of C+M (CACC self-driving and manually driven vehicles), A+M (ACC self-driving and manually-driven vehicles), and M (manually driven vehicles). Traffic flow as represented by the number of vehicles passing through a measurement point per 5 min is a function of vehicle density, as represented by the number of vehicles per km.

The penetration rate of the self-driving vehicle is 30%. In addition, the diagram shows the free-flow phase, and there is a positive linear relationship between traffic flow and vehicle density. In the jam phase, there is a negative linear relationship between traffic flow and vehicle density. The intersection of the free-flow and jam phases is called “critical density.” In the meta-stable phase, traffic flow is as high as in the freeflow phase even when vehicle density is greater than critical density.

For this study, we assume that the effect of relieving a traffic jam is greater as the traffic flow becomes larger than the traffic flow of the jam phase. The plots show that the free-flow phase transitions to the jam phase at 40 vehicles/km. We evaluated the effectiveness of the driving policy in vehicle density ranging from 40 to 60 vehicles/km (red dashed line).

Experimental procedure

A trial of experiment excuses two steps and each step consists of some episodes. Before an episode of simulation starts, we initialize the road by orienting the vehicles randomly and moving the vehicles around 1000 simulation times. We then execute a learning step, in which vehicles move around for a total of 1000 episodes (10,000 simulation times per episode), to be followed by an evaluation step in which vehicle move 100 episodes. We repeated this experiment 10 times and averaged the results.

Road and vehicle setting

We evaluate the proposed driving policy using a road model under periodic-boundary condition, which is the same condition as the learning step. Compared with the open-boundary condition in which vehicle density may change because of inflow rate, vehicle density is constant under the periodicboundary condition in order to evaluate the effect of driving policy on velocity without the confounding factor of inflow rate. The experimental conditions for road and vehicle are as follows: a time t = 1 s 1 cell = 5 m single-lane road under periodic-boundary condition limited velocity 5 cell/time = 90 km/h road length L = 100 cells road where perturbation occurs l = 5 cells perturbation probability p = 0.2 maximum communication distance dicom = 20 the number of communication partners nicom = 1

Learning setting

The probability of exploration is ϵ = 0:01 from 1 to 500 episodes, and ϵ = 0 from 501 to 1100 episodes, learning rate is = 0:01 from 1 to 1000 episodes, and = 0 from 1001 to 1100 episodes, and discount factor is = 0:9.

The elements of the six-dimensional vector of state s =(ϕvel;ϕgap;ϕrel;ϕc d;ϕc v;ϕc g) are listed as follows: ϕvel =fslow, middle, fastg ϕgap =fnext, short, long, not ing ϕrel =fdepart, track, approach, not ing ϕc d =fnear, far, disconnectedg ϕc v =fslow, middle, fast, disconnectedg ϕc g =fnext, short, long, not in, disconnectedg 40 42 44 46 48 50 52 54 56 58 60

Traffic density [volume / km] 200 Meta stable

CwP+M AwP+M

C+M A+M

M 40 42 44 46 48 50 52 54 56 58 60

Traffic density [volume / km] Table 2 lists the details of the elements.

The action a is ppol = 0 or ppol = 1.

Equation (2) determines the penalty as r. The self-driving vehicle accepts penalty when any of the following three conditions is satisfied; the first condition is when the self-driving vehicle stops; the second condition is when the self-driving vehicle has a gap larger than 7 cells; and the third condition is when the self-driving vehicle has an absolute value of relative speed more than 1 cell/time.

rt = { 1 vi(t) = 0 or gi > 7 or jvirel(t)j > 1 0 otherwise (2) 6.2

Experimental results

Figure 5 shows a fundamental diagram of GNS model with a penetration rate of 30%. Plots of the traffic flow for CwP+M (CACC self-driving with policy and manually driven vehicles) and AwP+M(ACC self-driving vehicle with policy and manually driven vehicles) indicate that both CwP+M and AwP+M relieve the traffic jam until vehicle density 44 vehicles/km. CwP+M traffic flow is greater than AwP+M traffic flow. Note that the meta-stable traffic flow (gray line) is the optimal traffic flow when all vehicles maintain limited velocity.

Figure 6 shows a fundamental diagram of the GNS model with a penetration rate of 10%. CwP+M and AwP+M successfully relieve the traffic jam for a vehicle density of 42 vehicles/km.

Figure 7 shows a fundamental diagram of the GNS model with a penetration rate is 70%. CwP+M achieves not only the highest but also near optimum traffic flow among all of the experiments up to a vehicle density of 60 vehicles/km.

Figure 8 shows traffic flow as a function of the penetration rate of self-driving vehicles. The traffic flow of C+M and A+M increases as the penetration rate climbs, but the traffic flow of CwP+M and AwP+M does not, which is to say that increasing the number of self-driving vehicles with a driving policy does not necessarily increase traffic flow.

Meta stable

CwP+M AwP+M

C+M A+M

M Meta stable

CwP+M AwP+M

C+M A+M

M 200 375

Measuring the effect of a driving policy for self-driving vehicles on relieving traffic jams

Table 3 shows the traffic volume and the average number of vehicles that stop per time unit in a traffic scenario having a vehicle density of 44 vehicles/km with 30% penetration rate for self-driving vehicles. The number of stopped vehicles decreases with increasing traffic flow, i.e., relieving traffic jams. There are two reasons for these results: one, a column of stopped vehicles is prevented from forming, and two, the column of stopped vehicles is dissolved quickly. When a column of stopped vehicles is formed because of a traffic jam, vehicles stop/start frequently. When a self-driving vehicle is introduced to the column, it accepts the stop penalty as it moves through the column as expressed in equation (2). The learner agent then learns the driving policy for preventing from forming the column, and for solving the column quickly. Consequently, the time during which the column exists on the road decreases, and all vehicles can smoothly drive without stopping.

The effect of inter-vehicle communication among self-driving vehicles on vehicle behavior

The difference between AwP+M and CwP+M is the number of communication parameters ncom and states s. i 200

Meta stable

CwP+M AwP+M

C+M

A+M

The difference between A+M and C+M is the number of communication parameters ncom. A+M and C+M traffic flow i increases with the increase in penetration rate of the selfdriving vehicle. Owing to the characteristics of GNS, the self-driving vehicle equipped with CACC has more opportunity to observe the leading vehicle as the penetration rate of the self-driving vehicle increases. If the self-driving vehicle observes the leading vehicle, the self-driving vehicle cuts needless deceleration.

However, the traffic flow difference between AwP+M and CwP+M is larger than the traffic flow difference between A+M and C+M. We so consider that the states s affects relieving traffic jam. In case of AwP+M, the features ϕc d ϕc v ϕc g become “disconnected” constantly. In contrast, the features of CwP+M become a communication partner’s information. Hence, observing the communication partner’s information significantly increases the effectiveness of the driving policy for the purpose of relieving traffic jams. 7

Conclusion

We proposed a driving policy for self-driving vehicles to help relieve traffic jams. A learner agent learned the driving policy, which was done via reinforcement learning with the data collected from the self-driving vehicle, which, in turn, were used to update the driving policy. This approach to developing a driving policy reduced the amount of time and labor that go toward designing driving policies for various traffic situations or changes in traffic situations. Our traffic flow simulation experiments under periodic-boundary conditions confirm that the use of the driving policy helps relieve traffic jams. Increased penetration rate of self-driving vehicles further reduces traffic jams and enhances traffic flow.

There are two issues that we intend to address in future studies: first, we intend to design a reward function and state feature to increase traffic flow with 100% penetration rate of the self-driving vehicle. Second, we plan to evaluate traffic flow using a road under an open-boundary condition which enables inflow, thereby changing vehicle density. A

Generalized Nagel–Schreckenberg Model

We used a modified GNS model [Ishikawa and Arai, 2015] for modeling traffic flow. In the unmodified version of the model, the number of communication parameters ncom is i common for all vehicles. However, to more accurately model traffic flow where manually driven and self-driving vehicles are present, the GNS model was modified to be able to set the number of communication parameters ncom and the maxi imum communication distance dicom for individual vehicles. A.1

GNS for vehicle i

At time t, all vehicles determine the next velocity simultaneously using Algorithm A 1. We explain Algorithm A 1 below.

Determine velocity: Vehicle i calculates the vehicle ihead i + ncom, which is the leading vehicle with respect i to maximum communication and maximum communication distance xmax xi(t) + dicom. Following Algorithm A 2 M axV , vehicle i determines the velocity for the next time increment: vi(t + 1).

Decelerate: In case of the manually driven vehicle, in which dcom is 0, the velocity of vehicle i becomes vi(t + i 1) max(0; vi(t + 1) 1) with perturbation probability p within the perturbation section of the road. For the selfdriving vehicle, the velocity of vehicle i becomes vi(t + 1) max(0; vi(t + 1) 1) with driving policy probability ppol.

Move: Vehicle i determines the next time coordinate xi(t+ 1) xi(t) + vi(t + 1).

A.2

MaxV

We explain the M axV that is showed at Algorithm A 2.

Accelerate: Vehicle i sets its own velocity vi(t + 1) min(vi(t) + 1; vlimit). If vehicle i has an adequate gap for velocity vi(t+a) after acceleration, vehicle i completes M axV .

Adjust the number of communications: Vehicle i modifies vehicle ihead in accordance with front vehicle i + 1’s number of communication parameters nic+om1 . If vehicle i + 1 has nic+om1 > 0 and satisfies ihead becomes ihead

nic+om1 . If vehicle i + 1 has nic+om1 == 0, which has no communication ability, then ihead becomes i.

Communicate: If front vehicle i + 1 exists behind ihead and within xmax, then vehicle i calculates the predicted front vehicle’s velocity vip+re1d by applying M axV . This is in case of communication with front vehicle i + 1.

Maximize velocity: In case of no communication, vehicle i determines the predicted front vehicle’s velocity vip+re1d max(0; min(vi+1(t); vlimit 1; gi+1 1)), even if the perturbation probability p = 1 is taken into account. (i + 1) > nic+om1 , then ihead 3: vi(t + 1)

Decelerate

4: vi(t + 1)

Move

5: xi(t + 1) Algorithm A 1 GNS for vehicle i

Determine velocity

1: ihead i + nicom 2: xhead xi(t) + dicom M axV (i; ihead; xhead) max(0; vi(t + 1) xi(t) + vi(t + 1) 1) probability p or ppol

Algorithm A 2 M axV (i; ihead; xhead) Accelerate

1: vi(t + 1) min(vi(t) + 1; vlimit) 2: if vi(t + 1) gi 3: return vi(t + 1) 4: end if Adjust the number of communications 5: if nic+om1 > 0 and ihead (i + 1) > nic+om1 6: ihead

i + 1 + nic+om1 7: else if nic+om1 == 0 8: ihead i 9: end if Communicate 10: if i + 1 13: vip+re1d 14: end if ihead and xi+1 xhead 11: vip+re1d max(0; M axV (i + 1; ihead; xhead) Maximize velocity 12: else 1) max(0; min(vi+1(t); vlimit 15: return min(vi(t + 1); vip+re1d + gi)

Finally, M axV returns min(vi(t + 1); vip+re1d + gi) as the maximum velocity.

[Forster et al., 2012 ]

Markus

Forster , Raphae¨l Frank, Mario Gerla , and Thomas Engel . Improving highway traffic through partial velocity synchronization . In Global Communications Conference (GLOBECOM) , 2012 IEEE, pages 5573 - 5578 . IEEE, 2012 .

[Forster et al., 2014 ]

Markus

Forster , Raphael Frank, Mario Gerla, and

Thomas

Engel . A cooperative advanced driver assistance system to mitigate vehicular traffic shock waves . In INFOCOM , 2014 Proceedings IEEE , pages 1968 - 1976 . IEEE, 2014 .

[Ishikawa and Arai , 2015]

Shota

Ishikawa and

Sachiyo

Arai . Evaluating advantage of sharing information among vehicles toward avoiding phantom traffic jam . In Winter Simulation Conference (WSC) , 2015 , pages 300 - 311 . IEEE, 2015 .

[Kamal et al., 2014 ]

Abdus Samad Kamal , Jun-ichi Imura , Tomohisa Hayakawa, Akira Ohata, and Kazuyuki Aihara . Smart driving of a vehicle using model predictive control for improving traffic flow . IEEE Transactions on Intelligent Transportation Systems , 15 ( 2 ): 878 - 888 , 2014 .

[Kesting et al., 2008 ]

Arne

Kesting , Martin Treiber, Martin Scho¨nhof, and Dirk Helbing. Adaptive cruise control design for active congestion avoidance . Transportation Research Part C: Emerging Technologies, 16 ( 6 ): 668 - 683 , 2008 .

[Khamis and Gomaa , 2014] Mohamed A Khamis and Walid Gomaa . Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework . Engineering Applications of Artificial Intelligence , 29 : 134 - 151 , 2014 .

[Knorr et al., 2012 ]

Florian

Knorr , Daniel Baselt, Michael Schreckenberg, and

Martin

Mauve . Reducing traffic jams via vanets . IEEE Transactions on Vehicular Technology , 61 ( 8 ): 3490 - 3498 , 2012 .

[Nagel and Schreckenberg , 1992]

Kai

Nagel and

Michael

Schreckenberg . A cellular automaton model for freeway traffic . Journal de physique I , 2 ( 12 ): 2221 - 2229 , 1992 .

[Papacharalampous et al., 2015 ] Alexandros E Papacharalampous, Meng

Wang

, Victor L Knoop, Bernat Gon˜i Ros, Toshimichi Takahashi , Ichiro Sakata, Bart van Arem, and Serge P Hoogendoorn. Mitigating congestion at sags with adaptive cruise control systems . In Intelligent Transportation Systems (ITSC) , 2015 IEEE 18th International Conference on, pages 2451 - 2457 . IEEE, 2015 .

[Rezaee et al., 2012 ]

Kasra

Rezaee , Baher Abdulhai, and

Hossam

Abdelgawad . Application of reinforcement learning with continuous state space to ramp metering in realworld conditions . In Intelligent Transportation Systems (ITSC) , 2012 15th International IEEE Conference on, pages 1590 - 1595 . IEEE, 2012 .

[Sugiyama et al., 2008 ]

Yuki

Sugiyama , Minoru Fukui, Macoto Kikuchi, Katsuya Hasebe, Akihiro Nakayama, Katsuhiro Nishinari, Shin-ichi Tadaki , and Satoshi Yukawa . Traffic jams without bottlenecks-experimental evidence for the physical mechanism of the formation of a jam . New journal of physics , 10 ( 3 ): 033001 , 2008 .

[Sutton and Barto , 1998] Richard S Sutton and

Andrew G

Barto . Reinforcement learning: An introduction , volume 1 . MIT press Cambridge, 1998 .

[Walraven et al., 2016 ]

Erwin

Walraven , Matthijs TJ Spaan, and

Bram

Bakker . Traffic flow optimization: A reinforcement learning approach . Engineering Applications of Artificial Intelligence , 52 : 203 - 212 , 2016 .

[Won et al., 2017 ]

Myounggyu

Won , Taejoon Park, and Sang H Son . Toward mitigating phantom jam using vehicle-to-vehicle communication . IEEE Transactions on Intelligent Transportation Systems , 18 ( 5 ): 1313 - 1324 , 2017 .

[ Zolfpour-Arokhlo et al., 2014 ]

Mortaza

Zolfpour-Arokhlo , Ali Selamat, Siti Zaiton Mohd Hashim, and

Hossein

Afkhami . Modeling of route planning system based on q value-based dynamic programming with multi-agent reinforcement learning algorithms . Engineering Applications of Artificial Intelligence , 29 : 163 - 177 , 2014 .