<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Cooperative Policy among Self-Driving Vehicles for Relieving Traffic Jams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shota Ishikawa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sachiyo Arai</string-name>
          <email>arai@tu.chiba-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chiba University 1-33 Yayoi-cho Inage-ku Chiba</institution>
          ,
          <addr-line>Japan 263-8522</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a novel driving policy which is a velocity control for self-driving vehicles to relieve traffic jams. Although the driving policy in previous research was empirically designed for a given traffic situation, which meant that the driving policy required to be reconfigured for every traffic situation and every change in traffic, we propose a driving policy that is learned by a learning agent via reinforcement learning using the data collected from the self-driving vehicles during simulation. The driving policy is relayed to the smart vehicles, which, in turn, are guided by the driving policy. To test and evaluate our proposed driving policy, we conducted traffic flow simulations with manually driven and self-driving vehicles in several scenarios wherein the two key parameters, vehicle density and self-driving vehicle penetration rate, are assigned different values. Our findings show that a driving policy for self-driving vehicles does relieve traffic jams in conditions such as (1) when the vehicle density is 42 vehicles/km and the penetration of the self-driving vehicle is 10% of the total traffic, and (2) when the vehicle density is 50 vehicles/km and the penetration of the self-driving vehicle is 70% of the total traffic (at which point traffic flow is nearly optimized). In addition, we found that intervehicle communication among self-driving vehicles provides real-time traffic information that relieves traffic jam even more effectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The self-driving vehicle is equipped with smart functions,
such as an adaptive cruise control (ACC) or cooperative
adaptive cruise control (CACC) that can penetrate and potentially
influence traffic flow. An ACC-equipped vehicle can
automatically detect the leading vehicle and can control
velocity using sensor and radar instruments. A CACC-equipped
vehicle can receive driving information from the vehicle
preceding it via vehicle-to-vehicle (V2V) communication. Some
papers have proposed a driving policy of ACC and CACC to
relieve traffic jams. For example, Kesting et al. [Kesting et
al., 2008] proposed the driving policy of ACC, and Forster
et al. [Forster et al., 2014] proposed the driving policy of
CACC. Detecting traffic condition, these vehicles drive
flexibility and improve traffic flow stability.</p>
      <p>However, the current practice of designing a driving
policy is challenging as the driving policy must account for
any number of traffic situations (road structures, traffic
regulations, etc.), consider perturbations induced by manually
driven vehicles, and direct and coordinate self-driving
vehicles. Designing driving policies requires simulation
trial-anderror, is labor intensive, and is time consuming.</p>
      <p>We propose the driving policy that is learned by a learner
agent via reinforcement learning using data that are collected
from the self-driving vehicles. In the proposed approach,
a learner agent for the driving policy simultaneously
interacts with the all self-driving vehicles in traffic simulation.
Collecting driving data of the self-driving vehicles that obey
the driving policy, the learner agent learns the driving policy
from driving data. After this interaction repeats, the learner
agent acquires the driving policy. To validate the proposed
approach, we introduce self-driving vehicles equipped with
driving policy into traffic jam simulations induced by
perturbation of a manually driven vehicle. Several traffic
situations having different vehicle densities and self-driving
vehicle penetration rates were used in the simulation. The
effectiveness of the driving policy on relieving traffic jam was
measured based on the amount of increase in traffic flow.</p>
      <p>The rest of this paper is organized as follows. In
Section 2, we discuss our approach to relieving traffic jams by
means of a learner agent that learns and updates the driving
policy through data collected from self-driving vehicles. In
Section 3, we describe a traffic problem scenario. In Section
4, we propose a framework for learning the driving policy
by a learner agent. In Section 5, we describe a Generalized
Nagel–Schreckenberg (GNS) model of traffic flow for
manually driven and self-driving vehicles. In Section 6, we
describe the traffic simulation experiments conducted based on
our proposed approach. In Section 7, we conclude this paper
with remarks on future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>The proposed approach aims at generating a driving policy
with data collected from self-driving vehicles and
reinforcement learning of the driving policy by a learner agent. This
approach is based on works related to traffic flow control in
terms of driving policy and reinforcement learning.</p>
      <p>To prevent traffic jams caused by the perturbation of a
manually driven vehicle, the vehicle must be able to
maintain an appropriate gap distance between itself and the
preceding vehicles to prevent perturbation from propagating
downstream to eventually be reflected upstream. Research
has been done on the effect of maintaining an
appropriate gap between vehicles for relieving traffic jam when one
vehicle, all vehicles, or some vehicles are regulated by a
driving policy [Kamal et al., 2014; Forster et al., 2012;
Papacharalampous et al., 2015].</p>
      <p>The driving policy for a manually driven vehicle may
include predicting a traffic situation using inter-vehicle
communication and recommending that the driver keep an
appropriate amount of distance [Knorr et al., 2012]. Kesting et al.
and Forster et al. proposed a driving policy for an ACC and
CACC self-driving vehicle that adapts to a traffic situation,
respectively [Kesting et al., 2008; Forster et al., 2014]. Won
et al. proposed fuzzy inference systems that effectively
capture the dynamics of traffic jams [Won et al., 2017]. Although
these approaches are effective ways of relieving traffic jams,
designing a driving policy that anticipates various traffic
scenarios is difficult. We propose an approach that uses a learner
agent to learn the driving policy in order to cut down on
designing the policy.</p>
      <p>Research on reinforcement learning for traffic flow
optimization includes finding policies dictating how speed
limits should be assigned to highway sections [Walraven et al.,
2016] and controlling ramp metering devices with Q-learning
[Rezaee et al., 2012]. For advanced reinforcement
learning approaches, a multi-objective reinforcement learning
involves learning the traffic signal policy [Khamis and Gomaa,
2014], and multi-agent reinforcement learning determines the
route planning [Zolfpour-Arokhlo et al., 2014]. In contrast,
our approach acquires the driving policy of the self-driving
vehicles.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Traffic Problem Scenario</title>
      <p>Figure 1 shows a traffic scenario involving two
roadto-vehicle communication infrastructures (R2Vs), N
selfdriving vehicles, and M manually driven vehicles. The R2Vs,
which share information on the number of self-driving or
manually driven vehicles passed by it, are installed at the edge
of a road section having length L. The R2Vs can calculate the
traffic density and the penetration rate of the self-driving
vehicle of the road section. The upstream R2V sends the
drivself-driving vehicles
manually-driven vehicles</p>
      <p>Road-to-Vehicle Communication
ing policy ; corresponding to and to the self-driving
vehicles passed by it.</p>
      <p>We propose a solution to relieving traffic jams on the
road by instituting driving policy ; , wherein the
selfdriving vehicle complies with driving policy ; ; that is,
the self-driving vehicle observes a state s, and performs
action output a expressed as ; (s) = a. The state s is a
six-dimensional vector s =(ϕvel;ϕgap;ϕrel;ϕc d;ϕc v;ϕc g),
where ϕvel;ϕgap;ϕrel;ϕc d;ϕc v; and ϕc g indicate velocity,
gap, relative velocity, communication distance between
communication partners, communication partner’s velocity, and
communication partner’s gap, respectively. The action a is
velocity control. The state s contains the information about
preceding vehicle, and the driving policy is cooperative
policy to relieving traffic jams.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Framework for the Reinforcement Learning of Driving Policy</title>
      <p>The reinforcement learning framework shown in Figure 2
comprises the traffic environment and the learner agent.</p>
      <sec id="sec-4-1">
        <title>Environment</title>
        <p>The traffic environment comprises self-driving and manually
driven vehicles on a road characterized by periodic-boundary
conditions. Because the number of vehicles is constant,
vehicle density and penetration rate are also constant. The
learner agent therefore learns the driving policy ; by
interacting with a traffic environment in which vehicle density
and penetration rate are constant.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Learner agent</title>
        <p>We explain a procedure that the learner agent updates the
driving policy whenever time t is updated from t to t + 1.
At time t, the learner agent delivers the driving policy t; ;
to all self-driving vehicles. Following equation (1), the
driving policy outputs randomly selected action with probability
ϵ or action a′ selected by argmaxa′ Q ; (s; a′) with 1 ϵ.
Here, the probability ϵ = fϵj0 ϵ 1g is a
parameter used to explore a new state, and Q ; (s; a) is an action
value function when the vehicle state and action are,
respectively, s and a. After all vehicles drive, at time t + 1, the
self-driving vehicles observe the next state st+1 and receive a
reward rt+1. The learner agent then collects the driving data
n = fst; at; st+1; rt+1g from the self-driving vehicle.
; (s) =
{ random select a
argmaxa′ Q ; (s; a′)
, get</p>
        <p>Manually-driven</p>
        <p>Self-driving
Learner</p>
        <p>Update</p>
        <p>Following Algorithm 1, the learner agent updates the
driving policy using dataset D = f njn 2 N g. First, the learner
agent inserts an action value Q ; into a Qn;ew. Second, the
learner agent updates the Qn;ew N times. The index n of the
most upstream self-driving vehicle is 1 and this index is
incremented by 1 from upstream to downstream. The Qn;ew is
updated by the equation at line 4 in Algorithm 1. Finally, the
learner agent inserts the Qn;ew into the action value Q ; .</p>
        <p>The equation in line 4 in Algorithm 1 is based on
Qlearning [Sutton and Barto, 1998]. Here, is the learning
rate, and is the discount factor. The learning rate is a
parameter indicating, in degrees, the update of the action value,
and the discount factor is a parameter that determines the
current value of a reward expected to be obtained in the
future. The self-driving vehicle accepts the reward according to
its own state. The learning agent determines the driving
policy that maximizes the action value that is the sum of rewards
r discounted by at each time t.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Simulation Modeling</title>
      <p>In this study, we used a modified Generalized Nagel–
Schreckenberg (GNS) model [Ishikawa and Arai, 2015] 1.
The N aSch model [Nagel and Schreckenberg, 1992], which
is the basic cellular automaton for the description of traffic
flow, can model the perturbation of each vehicle. The GNS
is used to model the number of communication partners nicom
and the maximum communication distance dicom.
5.1</p>
      <sec id="sec-5-1">
        <title>Terminology</title>
        <p>Figure 3 shows a notation of the GNS. The cellular
automaton model reproduces the traffic flow which is characterized
1The point of modification and driving rule are provided in the
appendix
by a series of cells that indicate whether a vehicle occupies
or does not occupy the cell. Vehicle i + 1 is ahead of vehicle
i, as the vehicle index is incremented by one. xi, gi, vi(t),
and vrel(t) indicate the coordinate, gap, velocity, and
relative velocity, respectively. The self-driving vehicle i (white
car) is able to communicate with the preceding white i + 2
(white car) within the given maximum communication
distance dcom. icom, di, gicom , and vicom (t) indicate the index of
i
the communication partner, the communication distance, the
gap that the communication partner possesses, and the
velocity of the communication partner, respectively.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Road model</title>
        <p>The GNS reproduces the road section along length L. The
road section contains the perturbation section along length
l(0 l L) in which the manually driven vehicle
decelerates at probability p. The occurrence of a traffic jam is due to
the deceleration of the manually driven vehicle within the
perturbation section [Sugiyama et al., 2008], which corresponds
to a sag or tunnel in the real world environment.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Vehicle model</title>
        <p>The GNS parameters for the manually driven vehicle, ACC
self-driving vehicle, and CACC self-driving vehicle are
shown in Table 1. The GNS parameters are set at a
probability of perturbation p, a probability of driving policy ppol, the
number of communication partners ncom, and the maximum
i
communication distance dicom. The manually driven
vehicle decelerates with probability p in the perturbation section,
but the self-driving vehicle does not decelerate. The
policydriven self-driving vehicle decelerates at probability ppol with
velocity control on any section of the road. The CACC
selfdriving vehicle has 1 nicom communication partners, and
the ACC or CACC self-driving vehicle has a maximum
communication distance of 1 dicom.</p>
        <p>Meta stable phase
Free flow
phase</p>
        <p>Critical
density</p>
        <p>Jam phase
Experiment in
these densities
Generally, traffic flow analysis focuses on the relationship
between traffic flow and vehicle density as shown in Figure
4 using a GNS model diagram comprising traffic flow plots
of C+M (CACC self-driving and manually driven vehicles),
A+M (ACC self-driving and manually-driven vehicles), and
M (manually driven vehicles). Traffic flow as represented by
the number of vehicles passing through a measurement point
per 5 min is a function of vehicle density, as represented by
the number of vehicles per km.</p>
        <p>The penetration rate of the self-driving vehicle is 30%. In
addition, the diagram shows the free-flow phase, and there is
a positive linear relationship between traffic flow and vehicle
density. In the jam phase, there is a negative linear
relationship between traffic flow and vehicle density. The intersection
of the free-flow and jam phases is called “critical density.” In
the meta-stable phase, traffic flow is as high as in the
freeflow phase even when vehicle density is greater than critical
density.</p>
        <p>For this study, we assume that the effect of relieving a
traffic jam is greater as the traffic flow becomes larger than the
traffic flow of the jam phase. The plots show that the free-flow
phase transitions to the jam phase at 40 vehicles/km. We
evaluated the effectiveness of the driving policy in vehicle density
ranging from 40 to 60 vehicles/km (red dashed line).</p>
      </sec>
      <sec id="sec-5-4">
        <title>Experimental procedure</title>
        <p>A trial of experiment excuses two steps and each step
consists of some episodes. Before an episode of simulation starts,
we initialize the road by orienting the vehicles randomly and
moving the vehicles around 1000 simulation times. We then
execute a learning step, in which vehicles move around for a
total of 1000 episodes (10,000 simulation times per episode),
to be followed by an evaluation step in which vehicle move
100 episodes. We repeated this experiment 10 times and
averaged the results.</p>
      </sec>
      <sec id="sec-5-5">
        <title>Road and vehicle setting</title>
        <p>We evaluate the proposed driving policy using a road model
under periodic-boundary condition, which is the same
condition as the learning step. Compared with the open-boundary
condition in which vehicle density may change because of
inflow rate, vehicle density is constant under the
periodicboundary condition in order to evaluate the effect of driving
policy on velocity without the confounding factor of inflow
rate. The experimental conditions for road and vehicle are as
follows:
a time t = 1 s
1 cell = 5 m
single-lane road under periodic-boundary condition
limited velocity 5 cell/time = 90 km/h
road length L = 100 cells
road where perturbation occurs l = 5 cells
perturbation probability p = 0.2
maximum communication distance dicom = 20
the number of communication partners nicom = 1</p>
      </sec>
      <sec id="sec-5-6">
        <title>Learning setting</title>
        <p>The probability of exploration is ϵ = 0:01 from 1 to 500
episodes, and ϵ = 0 from 501 to 1100 episodes, learning rate
is = 0:01 from 1 to 1000 episodes, and = 0 from 1001
to 1100 episodes, and discount factor is = 0:9.</p>
        <p>The elements of the six-dimensional vector of state
s =(ϕvel;ϕgap;ϕrel;ϕc d;ϕc v;ϕc g) are listed as follows:
ϕvel =fslow, middle, fastg
ϕgap =fnext, short, long, not ing
ϕrel =fdepart, track, approach, not ing
ϕc d =fnear, far, disconnectedg
ϕc v =fslow, middle, fast, disconnectedg
ϕc g =fnext, short, long, not in, disconnectedg
40 42 44 46 48 50 52 54 56 58 60</p>
        <p>Traffic density [volume / km]
200
Meta stable</p>
        <p>CwP+M
AwP+M</p>
        <p>C+M
A+M</p>
        <p>M
40 42 44 46 48 50 52 54 56 58 60</p>
        <p>Traffic density [volume / km]
Table 2 lists the details of the elements.</p>
        <p>The action a is ppol = 0 or ppol = 1.</p>
        <p>Equation (2) determines the penalty as r. The self-driving
vehicle accepts penalty when any of the following three
conditions is satisfied; the first condition is when the self-driving
vehicle stops; the second condition is when the self-driving
vehicle has a gap larger than 7 cells; and the third condition is
when the self-driving vehicle has an absolute value of relative
speed more than 1 cell/time.</p>
        <p>rt =
{
1 vi(t) = 0 or gi &gt; 7 or jvirel(t)j &gt; 1
0 otherwise
(2)
6.2</p>
      </sec>
      <sec id="sec-5-7">
        <title>Experimental results</title>
        <p>Figure 5 shows a fundamental diagram of GNS model with
a penetration rate of 30%. Plots of the traffic flow for
CwP+M (CACC self-driving with policy and manually driven
vehicles) and AwP+M(ACC self-driving vehicle with policy
and manually driven vehicles) indicate that both CwP+M and
AwP+M relieve the traffic jam until vehicle density 44
vehicles/km. CwP+M traffic flow is greater than AwP+M traffic
flow. Note that the meta-stable traffic flow (gray line) is the
optimal traffic flow when all vehicles maintain limited
velocity.</p>
        <p>Figure 6 shows a fundamental diagram of the GNS model
with a penetration rate of 10%. CwP+M and AwP+M
successfully relieve the traffic jam for a vehicle density of 42
vehicles/km.</p>
        <p>Figure 7 shows a fundamental diagram of the GNS model
with a penetration rate is 70%. CwP+M achieves not only
the highest but also near optimum traffic flow among all of
the experiments up to a vehicle density of 60 vehicles/km.</p>
        <p>Figure 8 shows traffic flow as a function of the penetration
rate of self-driving vehicles. The traffic flow of C+M and
A+M increases as the penetration rate climbs, but the traffic
flow of CwP+M and AwP+M does not, which is to say that
increasing the number of self-driving vehicles with a driving
policy does not necessarily increase traffic flow.</p>
        <p>Meta stable</p>
        <p>CwP+M
AwP+M</p>
        <p>C+M
A+M</p>
        <p>M
Meta stable</p>
        <p>CwP+M
AwP+M</p>
        <p>C+M
A+M</p>
        <p>M
200
375</p>
      </sec>
      <sec id="sec-5-8">
        <title>Measuring the effect of a driving policy for self-driving vehicles on relieving traffic jams</title>
        <p>Table 3 shows the traffic volume and the average number of
vehicles that stop per time unit in a traffic scenario having a
vehicle density of 44 vehicles/km with 30% penetration rate
for self-driving vehicles. The number of stopped vehicles
decreases with increasing traffic flow, i.e., relieving traffic jams.
There are two reasons for these results: one, a column of
stopped vehicles is prevented from forming, and two, the
column of stopped vehicles is dissolved quickly. When a column
of stopped vehicles is formed because of a traffic jam,
vehicles stop/start frequently. When a self-driving vehicle is
introduced to the column, it accepts the stop penalty as it moves
through the column as expressed in equation (2). The learner
agent then learns the driving policy for preventing from
forming the column, and for solving the column quickly.
Consequently, the time during which the column exists on the road
decreases, and all vehicles can smoothly drive without
stopping.</p>
      </sec>
      <sec id="sec-5-9">
        <title>The effect of inter-vehicle communication among self-driving vehicles on vehicle behavior</title>
        <p>The difference between AwP+M and CwP+M is the number
of communication parameters ncom and states s.
i
200</p>
        <p>Meta stable</p>
        <p>CwP+M
AwP+M</p>
        <p>C+M</p>
        <p>A+M</p>
        <p>The difference between A+M and C+M is the number of
communication parameters ncom. A+M and C+M traffic flow
i
increases with the increase in penetration rate of the
selfdriving vehicle. Owing to the characteristics of GNS, the
self-driving vehicle equipped with CACC has more
opportunity to observe the leading vehicle as the penetration rate
of the self-driving vehicle increases. If the self-driving
vehicle observes the leading vehicle, the self-driving vehicle cuts
needless deceleration.</p>
        <p>However, the traffic flow difference between AwP+M and
CwP+M is larger than the traffic flow difference between
A+M and C+M. We so consider that the states s affects
relieving traffic jam. In case of AwP+M, the features ϕc d ϕc v
ϕc g become “disconnected” constantly. In contrast, the
features of CwP+M become a communication partner’s
information. Hence, observing the communication partner’s
information significantly increases the effectiveness of the driving
policy for the purpose of relieving traffic jams.
7</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We proposed a driving policy for self-driving vehicles to help
relieve traffic jams. A learner agent learned the driving
policy, which was done via reinforcement learning with the data
collected from the self-driving vehicle, which, in turn, were
used to update the driving policy. This approach to
developing a driving policy reduced the amount of time and labor
that go toward designing driving policies for various traffic
situations or changes in traffic situations. Our traffic flow
simulation experiments under periodic-boundary conditions
confirm that the use of the driving policy helps relieve
traffic jams. Increased penetration rate of self-driving vehicles
further reduces traffic jams and enhances traffic flow.</p>
      <p>There are two issues that we intend to address in future
studies: first, we intend to design a reward function and state
feature to increase traffic flow with 100% penetration rate of
the self-driving vehicle. Second, we plan to evaluate traffic
flow using a road under an open-boundary condition which
enables inflow, thereby changing vehicle density.
A</p>
    </sec>
    <sec id="sec-7">
      <title>Generalized Nagel–Schreckenberg Model</title>
      <p>We used a modified GNS model [Ishikawa and Arai, 2015]
for modeling traffic flow. In the unmodified version of the
model, the number of communication parameters ncom is
i
common for all vehicles. However, to more accurately model
traffic flow where manually driven and self-driving vehicles
are present, the GNS model was modified to be able to set
the number of communication parameters ncom and the
maxi
imum communication distance dicom for individual vehicles.
A.1</p>
    </sec>
    <sec id="sec-8">
      <title>GNS for vehicle i</title>
      <p>At time t, all vehicles determine the next velocity
simultaneously using Algorithm A 1. We explain Algorithm A 1 below.</p>
      <p>Determine velocity: Vehicle i calculates the vehicle
ihead i + ncom, which is the leading vehicle with respect
i
to maximum communication and maximum communication
distance xmax xi(t) + dicom. Following Algorithm A 2
M axV , vehicle i determines the velocity for the next time
increment: vi(t + 1).</p>
      <p>Decelerate: In case of the manually driven vehicle, in
which dcom is 0, the velocity of vehicle i becomes vi(t +
i
1) max(0; vi(t + 1) 1) with perturbation probability
p within the perturbation section of the road. For the
selfdriving vehicle, the velocity of vehicle i becomes vi(t + 1)
max(0; vi(t + 1) 1) with driving policy probability ppol.</p>
      <p>Move: Vehicle i determines the next time coordinate xi(t+
1) xi(t) + vi(t + 1).</p>
      <p>A.2</p>
      <sec id="sec-8-1">
        <title>MaxV</title>
        <p>We explain the M axV that is showed at Algorithm A 2.</p>
        <p>Accelerate: Vehicle i sets its own velocity vi(t + 1)
min(vi(t) + 1; vlimit). If vehicle i has an adequate gap for
velocity vi(t+a) after acceleration, vehicle i completes M axV .</p>
        <p>Adjust the number of communications: Vehicle i
modifies vehicle ihead in accordance with front vehicle i + 1’s
number of communication parameters nic+om1 . If vehicle i + 1
has nic+om1 &gt; 0 and satisfies ihead
becomes ihead</p>
        <p>nic+om1 . If vehicle i + 1 has nic+om1 == 0, which
has no communication ability, then ihead becomes i.</p>
        <p>Communicate: If front vehicle i + 1 exists behind ihead
and within xmax, then vehicle i calculates the predicted front
vehicle’s velocity vip+re1d by applying M axV . This is in case
of communication with front vehicle i + 1.</p>
        <p>Maximize velocity: In case of no communication, vehicle
i determines the predicted front vehicle’s velocity vip+re1d
max(0; min(vi+1(t); vlimit 1; gi+1 1)), even if the
perturbation probability p = 1 is taken into account.
(i + 1) &gt; nic+om1 , then ihead
3: vi(t + 1)</p>
      </sec>
      <sec id="sec-8-2">
        <title>Decelerate</title>
        <p>4: vi(t + 1)</p>
      </sec>
      <sec id="sec-8-3">
        <title>Move</title>
        <p>5: xi(t + 1)
Algorithm A 1 GNS for vehicle i</p>
      </sec>
      <sec id="sec-8-4">
        <title>Determine velocity</title>
        <p>1: ihead i + nicom
2: xhead xi(t) + dicom
M axV (i; ihead; xhead)
max(0; vi(t + 1)
xi(t) + vi(t + 1)
1) probability p or ppol</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Algorithm A 2 M axV (i; ihead; xhead)</title>
      <sec id="sec-9-1">
        <title>Accelerate</title>
        <p>1: vi(t + 1) min(vi(t) + 1; vlimit)
2: if vi(t + 1) gi
3: return vi(t + 1)
4: end if
Adjust the number of communications
5: if nic+om1 &gt; 0 and ihead (i + 1) &gt; nic+om1
6: ihead</p>
        <p>i + 1 + nic+om1
7: else if nic+om1 == 0
8: ihead i
9: end if
Communicate
10: if i + 1
13: vip+re1d
14: end if
ihead and xi+1
xhead
11: vip+re1d max(0; M axV (i + 1; ihead; xhead)
Maximize velocity
12: else
1)
max(0; min(vi+1(t); vlimit
15: return min(vi(t + 1); vip+re1d + gi)</p>
        <p>Finally, M axV returns min(vi(t + 1); vip+re1d + gi) as the
maximum velocity.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Forster et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Markus</given-names>
            <surname>Forster</surname>
          </string-name>
          , Raphae¨l Frank,
          <string-name>
            <surname>Mario Gerla</surname>
            , and
            <given-names>Thomas</given-names>
          </string-name>
          <string-name>
            <surname>Engel</surname>
          </string-name>
          .
          <article-title>Improving highway traffic through partial velocity synchronization</article-title>
          .
          <source>In Global Communications Conference (GLOBECOM)</source>
          ,
          <year>2012</year>
          IEEE, pages
          <fpage>5573</fpage>
          -
          <lpage>5578</lpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Forster et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Markus</given-names>
            <surname>Forster</surname>
          </string-name>
          , Raphael Frank, Mario Gerla, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Engel</surname>
          </string-name>
          .
          <article-title>A cooperative advanced driver assistance system to mitigate vehicular traffic shock waves</article-title>
          .
          <source>In INFOCOM</source>
          ,
          <source>2014 Proceedings IEEE</source>
          , pages
          <fpage>1968</fpage>
          -
          <lpage>1976</lpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Ishikawa and Arai</source>
          , 2015]
          <string-name>
            <given-names>Shota</given-names>
            <surname>Ishikawa</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sachiyo</given-names>
            <surname>Arai</surname>
          </string-name>
          .
          <article-title>Evaluating advantage of sharing information among vehicles toward avoiding phantom traffic jam</article-title>
          .
          <source>In Winter Simulation Conference (WSC)</source>
          ,
          <year>2015</year>
          , pages
          <fpage>300</fpage>
          -
          <lpage>311</lpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Kamal et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Md</given-names>
            <surname>Abdus Samad Kamal</surname>
          </string-name>
          , Jun-ichi
          <string-name>
            <surname>Imura</surname>
            , Tomohisa Hayakawa, Akira Ohata, and
            <given-names>Kazuyuki</given-names>
          </string-name>
          <string-name>
            <surname>Aihara</surname>
          </string-name>
          .
          <article-title>Smart driving of a vehicle using model predictive control for improving traffic flow</article-title>
          .
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          ,
          <volume>15</volume>
          (
          <issue>2</issue>
          ):
          <fpage>878</fpage>
          -
          <lpage>888</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Kesting et al.,
          <year>2008</year>
          ]
          <string-name>
            <given-names>Arne</given-names>
            <surname>Kesting</surname>
          </string-name>
          , Martin Treiber, Martin Scho¨nhof, and Dirk Helbing.
          <article-title>Adaptive cruise control design for active congestion avoidance</article-title>
          . Transportation Research Part C: Emerging Technologies,
          <volume>16</volume>
          (
          <issue>6</issue>
          ):
          <fpage>668</fpage>
          -
          <lpage>683</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Khamis and Gomaa</source>
          , 2014]
          <article-title>Mohamed A Khamis and Walid Gomaa</article-title>
          .
          <article-title>Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework</article-title>
          .
          <source>Engineering Applications of Artificial Intelligence</source>
          ,
          <volume>29</volume>
          :
          <fpage>134</fpage>
          -
          <lpage>151</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Knorr et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Knorr</surname>
          </string-name>
          , Daniel Baselt, Michael Schreckenberg, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Mauve</surname>
          </string-name>
          .
          <article-title>Reducing traffic jams via vanets</article-title>
          .
          <source>IEEE Transactions on Vehicular Technology</source>
          ,
          <volume>61</volume>
          (
          <issue>8</issue>
          ):
          <fpage>3490</fpage>
          -
          <lpage>3498</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Nagel and Schreckenberg</source>
          , 1992]
          <string-name>
            <given-names>Kai</given-names>
            <surname>Nagel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Schreckenberg</surname>
          </string-name>
          .
          <article-title>A cellular automaton model for freeway traffic</article-title>
          .
          <source>Journal de physique I</source>
          ,
          <volume>2</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2221</fpage>
          -
          <lpage>2229</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Papacharalampous et al.,
          <year>2015</year>
          ] Alexandros E Papacharalampous,
          <string-name>
            <surname>Meng</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Victor L Knoop,
          <article-title>Bernat Gon˜i Ros, Toshimichi Takahashi</article-title>
          , Ichiro Sakata, Bart van Arem, and Serge P Hoogendoorn.
          <article-title>Mitigating congestion at sags with adaptive cruise control systems</article-title>
          .
          <source>In Intelligent Transportation Systems (ITSC)</source>
          ,
          <year>2015</year>
          IEEE 18th International Conference on, pages
          <fpage>2451</fpage>
          -
          <lpage>2457</lpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Rezaee et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Kasra</given-names>
            <surname>Rezaee</surname>
          </string-name>
          , Baher Abdulhai, and
          <string-name>
            <given-names>Hossam</given-names>
            <surname>Abdelgawad</surname>
          </string-name>
          .
          <article-title>Application of reinforcement learning with continuous state space to ramp metering in realworld conditions</article-title>
          .
          <source>In Intelligent Transportation Systems (ITSC)</source>
          ,
          <year>2012</year>
          15th International IEEE Conference on, pages
          <fpage>1590</fpage>
          -
          <lpage>1595</lpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Sugiyama et al.,
          <year>2008</year>
          ]
          <string-name>
            <given-names>Yuki</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          , Minoru Fukui, Macoto Kikuchi, Katsuya Hasebe, Akihiro Nakayama, Katsuhiro Nishinari, Shin-ichi
          <string-name>
            <surname>Tadaki</surname>
            , and
            <given-names>Satoshi</given-names>
          </string-name>
          <string-name>
            <surname>Yukawa</surname>
          </string-name>
          .
          <article-title>Traffic jams without bottlenecks-experimental evidence for the physical mechanism of the formation of a jam</article-title>
          .
          <source>New journal of physics</source>
          ,
          <volume>10</volume>
          (
          <issue>3</issue>
          ):
          <fpage>033001</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Sutton and Barto</source>
          , 1998] Richard S Sutton and
          <string-name>
            <given-names>Andrew G</given-names>
            <surname>Barto</surname>
          </string-name>
          .
          <article-title>Reinforcement learning: An introduction</article-title>
          , volume
          <volume>1</volume>
          . MIT press Cambridge,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Walraven et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Erwin</given-names>
            <surname>Walraven</surname>
          </string-name>
          , Matthijs TJ Spaan, and
          <string-name>
            <given-names>Bram</given-names>
            <surname>Bakker</surname>
          </string-name>
          .
          <article-title>Traffic flow optimization: A reinforcement learning approach</article-title>
          .
          <source>Engineering Applications of Artificial Intelligence</source>
          ,
          <volume>52</volume>
          :
          <fpage>203</fpage>
          -
          <lpage>212</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Won et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Myounggyu</given-names>
            <surname>Won</surname>
          </string-name>
          , Taejoon Park, and
          <string-name>
            <surname>Sang H Son</surname>
          </string-name>
          .
          <article-title>Toward mitigating phantom jam using vehicle-to-vehicle communication</article-title>
          .
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          ):
          <fpage>1313</fpage>
          -
          <lpage>1324</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [
          <string-name>
            <surname>Zolfpour-Arokhlo</surname>
          </string-name>
          et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Mortaza</given-names>
            <surname>Zolfpour-Arokhlo</surname>
          </string-name>
          , Ali Selamat, Siti Zaiton Mohd Hashim, and
          <string-name>
            <given-names>Hossein</given-names>
            <surname>Afkhami</surname>
          </string-name>
          .
          <article-title>Modeling of route planning system based on q value-based dynamic programming with multi-agent reinforcement learning algorithms</article-title>
          .
          <source>Engineering Applications of Artificial Intelligence</source>
          ,
          <volume>29</volume>
          :
          <fpage>163</fpage>
          -
          <lpage>177</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>