<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of
Information and Data Management 15 (2024) 186-195. doi:10.5753/jidm.2024.3328.
[19] H. U. Gobbi</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/978-3-319-25808-9∖_4</article-id>
      <title-group>
        <article-title>Learning From Your Virtual Neighbors: a Reinforcement Learning Approach to Trafic Signal Control</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ana L. C . Bazzan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henrique U. Gobbi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Computer Science</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Universidade Federal do Rio Grande do Sul (UFRGS)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Porto Alegre</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brazil</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>3173</volume>
      <fpage>47</fpage>
      <lpage>66</lpage>
      <abstract>
        <p>Many problems in trafic management and control are inherently distributed and/or require adaptation to the trafic situation. Hence, there is a close relationship to multiagent reinforcement learning. However, using reinforcement learning poses challenges when the state space is large. For instance, the learning task may take long. In order to accelerate this process, we allow agents that are similar to transfer experiences. We run experiments using a trafic network in which we vary the trafic situation along time. We compare our approach to the use of Q-learning without transfer of experiences, and show that there is an overall improvement in the number of stopped vehicles.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Trafic Signal Control</kwd>
        <kwd>Multiagent Reinforcement Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>2.1. Reinforcement learning</title>
        <p>to be followed by the agent. One of these strategies is -greedy, where an action is randomly chosen
(exploration) with a probability , or, with probability 1-, the best known action is chosen, i.e., the one
with the highest value so far (exploitation).</p>
        <p>It is assumed that the agent has sensors to determine its current state and can then decide on an
action. The reward is then used to update its policy, i.e., a mapping from states to actions. This policy
can be generated or computed in several ways.</p>
        <p>
          For the sake of the present discussion, we concentrate on a model-free, of-policy algorithm called
Q-learning [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which estimates so-called Q-values using a table to store the experienced values of
performing a given action when in a given state. Hence Q-learning is a tabular method, where the state
space and the action space need to be discretized.
        </p>
        <p>In RL, the learning task is usually formulated as a Markov decision process (MDP), that defines the
sets of states and actions, a transition function, and a reward function. Since the transition and the
reward functions are unknown to the agent, its task is precisely to learn them, or at least a model for
them.</p>
        <p>
          In Q-learning, the value of a state  and action  at time  is updated based on Eq. 1, where  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]
is the learning rate,  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is the discount factor, +1 is the next state and  is the reward received
when the agent moves from  to +1 after selecting action  in state .
        </p>
        <p>(, ) ←
(, ) +  ( +  max((+1, )) − (, ))

(1)</p>
        <p>When there are multiple agents interacting in a common environment, the RL task (thus, multiagent
RL) is inherently more complex because agents’ actions are highly coupled and agents are trying to
adapt to others that are also learning. Despite this drawback, in many real-world problems, where the
control is decentralized, it might not be possible to avoid a multiagent RL formulation. This is the case
regarding the scenario we deal with in the present paper, namely control of trafic signals, whose basic
concepts we briefly review next.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2.2. RL-based trafic signal control</title>
        <p>
          Besides safety and other issues, one aim of a trafic signal controller is to decide on a split of green times
among the various phases that were designed to deal with geometry and flow issues at an intersection.
This can be done in several ways (for more details, please see a textbook such as [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]). In this paper, the
controller is given a set of phases and has to decide which one will receive right of way (green light).
        </p>
        <p>A phase is defined as a group of non-conflicting movements (e.g., flow in two opposite trafic
directions) that can have a green light at the same time without conflict.</p>
        <p>In its simplest form, the control is based on fixed times, whose split of the green time among the
various phases can be computed based on historical data on trafic flow, if available. The problem with
this approach is that it may have dificulties adapting to changes in the trafic demand. This may lead
to an increase in the number of stopped vehicles. To mitigate this problem, it is possible to use an
adaptive scheme and thus give priority to lanes with longer queues (or other measures of performance).
Adaptive approaches based on RL were developed, as we discuss next.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Related work</title>
      <p>
        Trafic control techniques stem mainly from the areas of control theory and operations research. More
recently though, techniques from artificial intelligence and multiagent systems have also been employed,
especially in connection with RL. RL is used by trafic signals to learn a policy that maps states (usually
queues at intersections) to actions. Due to the number of works that employ RL for trafic control, the
reader is referred to surveys [
        <xref ref-type="bibr" rid="ref3">3, 4, 5, 6, 7</xref>
        ]. In any case, those surveys show that there has already been
a significant contribution of RL techniques to control of trafic signals. However, some issues arise,
especially if tabular methods (e.g., the aforementioned Q-learning) are used. Tabular methods require
agents to visit all state-action pairs. In order to accelerate this process, in the present paper we propose
the use of transfer of experiencies among similar agents.
      </p>
      <p>Other authors have addressed that issue in diferent ways. Next, we briefly review some of these
approaches.</p>
      <p>A first line of research does use tabular methods, with various levels of discretization of the state
space, thus depicting diferent levels of performance of the learning task. In this class, well-known
works include [8, 9, 10, 11], among others.</p>
      <p>Other works avoid using tabular methods, substituting it for various kinds of function approximation
techniques such as tile coding [12]; deep neural networks, such as DQN [13]; and linear function
approximation [14].</p>
      <p>Accelerating the RL task underlying trafic signal control was addressed using several techniques.
For example, in [15], agents communicate using a hierarchy of supervising agents; in [16], -nearest
neighbors is employed to estimate the Q-values of an state by calculating the weighted average of the
Q-value estimates of the  nearest states.</p>
      <p>Finally, the idea of using a graph to establish the relationship among learning agents was used in trafic
networks in [17, 18, 19]; however, these works aimed at driver agents rather than junctions/intersections.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Methodology</title>
      <p>As mentioned, our approach is based on using (potentially non-local) communication to augment the
information each trafic signal agent has. In the next sections, we detail the communication aspect
(Section 4.3). For this purpose we employ a graph of relationships. Because this graph does not
necessarily takes only agents that are neighbors into account, we call it a virtual graph (henceforth
VG); this is explained in Section 4.1. The underlying MDP is explained in Section 4.2.</p>
      <sec id="sec-3-1">
        <title>4.1. The Road and the Virtual Graphs</title>
        <p>The road network, as conventionally used in trafic and transportation engineering, is a graph  =
(, ), where  is the set of junctions (intersections), and  is the set of links. We use the term link,
since it is more commonly used in trafic engineering (and then reserve the term edge for another
graph, as described ahead). One instance of  is depicted in the next section (Fig. 2), where we discuss
that scenario in more detail. Here, it sufices to note that intersections B2 and C2 (for instance) are
geographical neighbors.</p>
        <p>We also define another graph – the virtual graph –, which accounts for non-local information. In
such graph, we connect two junctions 1 ∈  and 2 ∈  , which are not necessarily physically close (as,
e.g., D2 and B2 in Fig. 2), but that may have similar patterns in terms of the learning task. We call this
a virtual graph denoted by   = (, ), where  is as defined before, and  is the set of edges that
connect two junctions that have similar patterns.</p>
        <p>In order to define when two junctions are to be connected in  , historical information is collected
for a road network . This information refers to several attributes: travel time over all links in an
intersection, fuel consumption and several kinds of gas emissions such as CO, CO2, HC (hydrocarbon),
PMx (particulate matter), and NOx. This information is collected per lane, per time interval, and then
aggregated so that each junction has a single value associated with each of those attributes (in our case,
we collect such data from a microscopic simulator). Moreover, such information is aggregated over a
time window ℎ. Then, values of all attributes are normalized between zero and one.</p>
        <p>After the normalization, the values of the attributes for each two pairs of junctions are compared. If
two junctions 1 and 2 have the same values for all attributes (given a tolerance value, i.e. ±  ), then an
edge connecting 1 and 2 is inserted in the  .</p>
        <p>Fig. 1a shows an instance of such a virtual graph, whereas Fig. 1b depicts a zoom of that graph, where
some relationships among similar junctions can be better seen. The labels of the vertices are formed by
the junction ID plus the timestamp in which their values were found to be similar.</p>
        <p>The agent keeps a record of each visited pair state-action, as well as an estimate of the Q-values
for each of these pairs. Such values depend on the rewards obtained by performing action  at state
. Rewards are given by the number of vehicles that are queued in a given intersection. This value is
provided by the microscopic simulator. As explained next, an agent may also receive information from
other agents.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.3. How Communication Works</title>
        <p>Next we briefly explain how the communication is performed by the elements of the road network
. We assume that every junction  ∈  a is equipped with a communication device (henceforth,
CommDev) that is able to send and receive messages to and from other junctions. For instance, in Fig. 2,
junctions D2 and B2 have a relationship in VG, that refers to a given time step . Thus, they initiate a
communication via their respective communication devices in order for D2 to transfer its knowledge to
B2.</p>
        <p>This way, an agent that has received information perceives it as expected rewards for the action
available to it in that particular state. This means that we extend the learning process vis-a-vis the
standard Q-learning algorithm. When using Q-learning, at any given step  the agents simply update
their Q-values based on the feedback from the action they have just taken at a given state. However, in
our case, agents also update their Q-values based on the expected rewards received by the CommDev’s.
This means that at every time in which an agent  needs to make a decision, it asks and receives the
following tuple regarding all agents with which it has a relationship in the VG: &lt; , ,  &gt;, where
 is a virtual neighbors of . Each of these tuples is then used to update the Q-table of  (in case  has
never seen a given pair (, ), this pair is inserted in the table with the corresponding Q-value).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experiments, Results, and Analysis</title>
      <sec id="sec-4-1">
        <title>5.1. Scenario</title>
        <p>Simulations were performed using SUMO [20], a microscopic simulator.</p>
        <p>The test scenario is the trafic network with 12 intersections depicted in Fig. 2. Given that the links
are one-way, not all intersections require a trafic signal controller. This is the case of the four corner
intersections, as well as intersections A2, C1, and B3. Other non-signalized intersections are B1 and
C3. These two are regulated by a priority mechanism defined in SUMO, namely all vehicles decelerate
before reaching the intersection and SUMO regulates the right of using the intersection. We did this in
order to concentrate on the intersections defined by the arterial depicted in the middle of the figure,
namely the one that starts at intersection A2 and ends at D2. In this arterial, there are three signalized
intersections that are then each controlled by a learning agent, namely B2, C2, and D2. Recall that, due
to its geometry (no conflict between approach links), A2 does not require a trafic signal.</p>
        <p>Each of the agents learns using the method described in Section 4.</p>
        <p>We also stress that we have created this network with a particular feature in mind: the total number
of vehicles is kept constant (except for the first steps, when they are still being inserted at the link
B3-C3). This avoids many problems that are common in the literature, where it was not possible to
fully isolate the behavior of the RL approach from SUMO’s mechanism for routing vehicles.</p>
        <p>In the network depicted in Fig. 2, there is a re-routing device in the link located right before intersection
A2 that is responsible to re-route each vehicle when it reaches that link. Essentially, vehicles keep
running in loops, never leaving the simulation. This allows us to control how the trips use the network,
which is an important question we deal with here, as discussed ahead (non-stationarity). Specifically,
200 trips are generated (which, for this network represents almost 50% of overall density, given that its
maximum capacity is around 500 vehicles).</p>
        <p>Another feature of our scenario is non-stationarity due to changes in the trafic volume that traverses
each intersection. In previous works such as [21, 16] it was shown that the respective methods were
able to adapt to change in trafic contexts or simply context. By context we mean that, from time to time,
A2
A3
B2
B3
C3</p>
        <p>C2</p>
        <p>D2
D3
the way trips are distributed over the routes are changed. Note that this does not mean that the number
of trips (vehicles) using the network changes, but rather that each trip may use a diferent route, thus
the use of the links, and, hence, of the intersections, difers along time.</p>
        <p>These changes in context occur at each 5,000 simulation steps, as depicted in Fig. 3, where the time
steps are represented in the  axis. Note that the figure just shows one change in context (at step 5,000),
while in fact there is a change at each 5000 steps, in a repeated way.</p>
        <p>As seen, in the first context (see bottom part of Fig. 3), both junctions B2 and C2 have similar situations
– receiving more volume of vehicles in the horizontal direction – while D2 receives near equal volumes
in both directions. At time step 5,000, there is a change in this patters (see top part of the figure). At
time step 10,000 the former context occurs again and so on.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Values of the Parameters</title>
        <p>Each simulation runs for 15,000 seconds. Our plots show the average and deviations over 15 repetitions
of each case.</p>
        <p>The value used for minGreenTime and for maxGreenTime were was 10 and 50 seconds respectively.
Also, we recall that one action time step corresponds to five seconds of real-life clock time.</p>
        <p>As for the learning parameters, after experimenting with other values, we have set  = 0.05,  = 0.95
and  starting at 1.0 and decaying by the rate of 0.995 at each action time step, up to a minimum value
of  = 0.05, which is commonly used in the literature.</p>
        <p>Finally, regarding the VG, we tried several values for the  ; here we show results for  = 0.002.</p>
      </sec>
      <sec id="sec-4-3">
        <title>5.3. Results and Discussion</title>
        <p>In order to assess the efectiveness and eficiency of our approach, we compare it to the case where
Q-learning is used without the VG, as well as with the case in which no learning mechanism is used.
Henceforth the latter is referred to as fixed time, since there is a fixed control rule, i.e., the timings
of the green signals are optimized once (by SUMO) and remain fixed. We remark that other types of
controllers – not necessarily based on RL– could be employed for the sake of comparison; however, due
to space limitations, we cannot discuss such results here.</p>
        <p>In all cases, as mentioned, the network used is the one depicted in Fig. 2. The assessment is based on
number of stopped vehicles, as commonly used in the literature.</p>
        <p>Fig. 4 shows the number of stopped vehicles in the whole network (average over 15 repetitions),
and it functions as a baseline, given that there is no learning mechanism of any kind. Rather, as
aforementioned, the signal timings remain fixed. Note the oscillations found in the average curve (the
deviations are given as a light color shadow), caused by the fact that vehicles are randomly routed, but
the timing of the signals is fixed. Further, this is the case for all contexts.</p>
        <p>We now show and discuss the cases in which the agents learn using QL or QL plus the VG.
0
2000 4000 6000
steps
system_total_stopped
system_total_stopped
100
80
60
40
20
0
35
30
25
20
15
10
5
0
2000 4000 6000
steps
8000 10000 12000 14000
0
slightly better. This is due to the fact that agent at B2 receives information from the other two agents
but, due to the initial exploration, such information may not be adequate, thus leading to the agent
performing actions that are not eficient for its particular state. With time, this changes; here the best
performance is seem within context number two, when B2 receives valuable information from the other
agents.</p>
        <p>For C2 and D2, there was no significant improvement. We are investigating two causes for this
behavior. A first hypothesis is that C2 actually receives less trafic than the other intersections, in all
contexts. This means that, even if the pattern is similar to B2, the magnitude of the flow is not the same.
A second explanation may be related to the way the state space is coded, which could influence the
way the information is communicated to other agents. Note that C2 has a vertical trafic direction that
is opposite to B2, thus technically the states may be diferent.</p>
        <p>As for D2, we only observe improvements in the third context, which means that this intersection is
receiving noise information from the other agents. In a future work, we plan to filter out this noise by
means of a smarter mechanism, able to put less weight on the received information, especially when it
deviates too much from the already acquired one, i.e., the putting more weight on the local information.</p>
        <p>In short, Fig. 5 shows that there is an advantage in letting virtual agents share their experiences. The
performance improves to a diferent extent among the agents; some profit more than others.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion and future work</title>
      <p>In this paper, we discuss how multiagent RL can be combined with transfer of experiences. The novelty
of our approach lies in the fact that such transfer does not happen only among neighboring agents, but
rather, among agents that have similar trafic patterns at a given time step. This is particularly useful
when dealing with issues that arise when the state space is large. In such cases, a poor discretization
may lead to poor performance.</p>
      <p>We show the use of our approach using a trafic signal control scenario, where we also deal with
changes in context, i.e., in the trafic flow patterns.</p>
      <p>Our results show that, compared to a baseline stemming from standard RL, i.e., Q-learning, the
proposed approach performs better in terms of stopped vehicle and also presents less oscillations.</p>
      <p>We recall that we have considered emissions when constructing the virtual graph; however, we have
used only travel time as reward. Thus, as a first next step, we plan to extend the approach to deal with
more than one objective. Secondly, regarding the experiments, we intend to use scenarios in which
agents are heterogeneous (e.g., they have a diferent set of actions that arise due to diferent set of
phases), and investigate the issue of how states are coded and information transferred to other agents.
Finally, we plan to extend the experiments in order to deal with situations where intersections require
more than two phases, and also to perform experiments comparing our results to other kinds of trafic
signal controllers.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work is partially supported by CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível
Superior - Brazil, Finance Code 001). Ana Bazzan is partially supported by CNPq under grant number
304932/2021-3.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Watkins</surname>
          </string-name>
          , Learning from Delayed Rewards,
          <source>Ph.D. thesis</source>
          , University of Cambridge,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Roess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Prassas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. R.</given-names>
            <surname>McShane</surname>
          </string-name>
          , Trafic Engineering, 3rd ed.,
          <source>Prentice Hall</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. L. C.</given-names>
            <surname>Bazzan</surname>
          </string-name>
          ,
          <article-title>Opportunities for multiagent systems and multiagent reinforcement learning in trafic control</article-title>
          ,
          <source>Autonomous Agents and Multiagent Systems</source>
          <volume>18</volume>
          (
          <year>2009</year>
          )
          <fpage>342</fpage>
          -
          <lpage>375</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s10458-008-9062-9.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>