1. Introduction

The Real Deal: A Review of Challenges and Opportunities in Moving Reinforcement Learning-Based Trafic Signal Control Systems Towards Reality

Rex Chen

Fei Fang

Norman Sadeh

0 0 Institute of Software Research, School of Computer Science, Carnegie Mellon University

Trafic signal control (TSC) is a high-stakes domain that is growing in importance as trafic volume grows globally. An increasing number of works are applying reinforcement learning (RL) to TSC; RL can draw on an abundance of trafic data to improve signalling eficiency. However, RL-based signal controllers have never been deployed. In this work, we provide the first review of challenges that must be addressed before RL can be deployed for TSC. We focus on four challenges involving (1) uncertainty in detection, (2) reliability of communications, (3) compliance and interpretability, and (4) heterogeneous road users. We show that the literature on RL-based TSC has made some progress towards addressing each challenge. However, more work should take a systems thinking approach that considers the impacts of other pipeline components on RL.

eol>Trafic signal control Reinforcement learning Intelligent transportation system System deployment Review

1. Introduction

As the trafic volume of metropolitan areas continues to grow worldwide, gridlock is becoming an increasingly prevalent concern. According to the 2021 Urban Mobility Report [ 1 ], gridlock led to over 4 billion hours in travel delay and $100+ million in congestion costs across the United States in 2021. This not only impacts commercial productivity but also has environmental consequences. One important mechanism for alleviating gridlock is improving the timing of trafic signals [ 2 ]. Historically, most jurisdictions have used fixed timing plans based on trafic models, which assume fixed values of factors such as lane volumes and arrival rates [ 3 ]. To minimize implementation burden, traditional trafic signal control (TSC) either uses one fixed plan throughout the entire day, or rotates through several plans depending on the time of the day. However, fixed plans cannot respond in real time to changes in trafic demand [ 3, 4 ].

Large trafic volumes also ofer an abundance of data that can be used for real-time optimization of signal timing plans. Many deployed systems combine logic-triggered state changes with data-driven searches over sets of schedules [ 3 ]. However, an increasing number of approaches traverse larger search spaces using optimization and scheduling algorithms [ 5 ]. Among these approaches, reinforcement learning (RL) has yielded significant improvements over fixed and actuated TSC algorithms in simulations [ 6 ]. RL allows systems to learn from the consequences of their decisions, which enables them to achieve continuous self-improvement. Deployments of RL algorithms have achieved success in a variety of complex domains involving human interaction, such as card games [ 7 ], real-time strategy games [ 8 ], and other applications in transportation such as dispatching for ride-hailing services [ 9 ].

However, to our knowledge, RL-based TSC algorithms have never been deployed. This is in spite of the fact that papers introducing novel algorithms in this area commonly list real-world deployment as a goal for future work [ 10 ]. We believe that this discrepancy has arisen due to a focus on methodological contributions, instead of on a holistic systems thinking approach based on the data-to-deployment pipeline [ 11 ]. If RL-based signal controllers are to achieve success in deployment, domain experts in TSC and in RL must have a shared view of the problem. We take a step towards bridging the gap between research and deployment by providing the first review of challenges that may arise from end-to-end deployments of RL-based TSC, which we intend to provide a common basis of collaboration between research in TSC and RL.

We begin by describing our review methodology in Section 1.1. Then, we provide a high-level review of the fields of TSC and RL in Section 2. Next, we explore four engineering challenges. For each of these challenges, we will provide a review of (1) how these challenges are significant concerns for the state of the art in RL-based TSC; (2) what practical considerations relevant to these challenges have arisen in deployments of non-RL TSC systems; and (3) what progress has been made in the RL-based TSC literature towards solving these challenges. • Uncertainty in detection. (Section 3) Typically, RL-based TSC algorithms learn based on metrics such as queue length or travel time. These require accurate vehicle detection technologies, which may not always be available in the field. Strategies to deal with detector uncertainty and failure are a prerequisite of deployment. • Reliability of communications. (Section 4) Some decentralization is necessary for RLbased TSC. Coordination between intersections is important for optimizing network-level metrics, yet most work in RL-based TSC has not considered the practicalities of dealing with failure and latency in inter-intersection communications. • Compliance and interpretability. (Section 5) Jurisdictions will not have confidence in RL-based signal controllers without assurances about compliance to standards (e.g., minimum green time) and safety requirements. The interpretability of models is important for ensuring that signalling plans can be audited and adjusted by stakeholders. • Heterogeneous road users. (Section 6) Most simulations for RL-based TSC assume that all cars are the same size and have the same free-flow speed. However, cars share the road with pedestrians, buses, emergency vehicles, and other road users. Algorithms must detect and respond to the needs of diferent road users in a safe, equitable manner.

Finally, we end with concluding thoughts and suggestions for future work in Section 7.

1.1. Methodology

– To obtain an overview of the domain of RL-based TSC, we conducted a targeted search on Google Scholar with the keywords “trafic signal”/“trafic light”, “reinforcement learning”, and “review”/“survey”. We identified the four challenges addressed in the following sections through these reviews. From here, we conducted snowball sampling based on their citations to locate papers in the RL literature that discuss these challenges. For RL papers, we focused on those published after 2015, since this field has rapidly evolved over the past several years. We also performed additional targeted Google Scholar searches to find literature which describes non-RL deployments of TSC, by searching the keywords “trafic signal”/“trafic light” and “adaptive” in conjunction with the following keywords: • For Section 3, “uncertainty”, “noise”, “sensing error”, “accuracy’. • For Section 4, “coordination”, “communication”, “closed loop”, “message”, “NTCIP”. • For Section 5, “compliance”, “safety”, “accountability”, “interpretability”/“explainability”. • For Section 6, “pedestrian”/“leading pedestrian interval”, “cyclist”, “transit”, “emergency vehicle”, “priority”, “preempt”.

2. Related work 2.1. Trafic signal control

Trafic signal control (TSC) aims to allocate green time at an intersection to trafic moving in diferent directions. Every approach (roadway entering the intersection) is split into lanes for forward, left-turn, and (possibly) right turn movements (which may be assumed to always be permissible) [ 12, 13 ]. For eficiency, pairs of compatible movements are often arranged into phases and signalled simultaneously [ 10, 14, 15 ]. The task is to find some division of green time between phases for each intersection in a road network, which maximizes metrics such as the throughput of the network. We refer the reader to [ 16 ] for details of the problem formulation.

Diferent approaches to dividing green time include choosing phase durations or phase sequences, or fixing a phase sequence within a cycle and choosing the length of the cycle or the proportions of each phase within the cycle [ 10, 12, 15 ]. Three main types of algorithmic approaches exist. In fixed-time control, which has historically been a popular strategy [ 3 ], a small number of fixed plans are optimized based on past trafic data under the assumption of uniform demand. In actuated control, detector inputs (such as vehicle presence data from loop detectors) are used in conjunction with a fixed set of logical rules. Finally, adaptive control uses more complex prediction and optimization algorithms to control signalling plans [ 12, 16 ].

2.2. Reinforcement learning

One emerging approach to adaptive control has been reinforcement learning (RL). RL is a sequential decision-making paradigm wherein agents learn how to act through trial-and-error interactions with an environment. The goal of RL is to learn policies, which describe how agents – should act given the state of the environment. Early work in reinforcement learning during the 1980s and 1990s, which included the seminal -learning algorithm [ 17 ], relied on tabular enumeration of environment states and agent actions. RL remained relatively dificult to scale until the emergence of methods based on function approximation in the 2010s, specifically the use of neural networks for deep RL [ 18 ]. Since then, the popularity and complexity of RL has experienced explosive growth. Deep RL has also found novel applications in practical domains such as robotics, natural language processing, finance, and healthcare [ 19 ]. Transportation has been one of the most significant applications of deep RL, with tasks including autonomous driving [ 20 ], vehicle dispatching [ 9 ] and routing [ 21 ], and trafic signal control (see Section 2.3). We refer the reader to [ 22 ] for an in-depth review of the history of reinforcement learning.

The body of work that we review in this paper can be seen as a parallel to work in RL for robotics that attempts to close the gap between simulations and reality. RL methods, especially deep RL methods, require an abundance of data to learn from environmental interactions. Due to the cost of real-world data collection, simulators are often employed instead to generate large quantities of interactions. However, simulators can never perfectly emulate reality. This problem, which is referred to as the reality gap [ 23 ], has been addressed by the sim-to-real literature. Some sim-to-real methods employ randomization in sensors and controllers to learn robust policies (domain randomization); some explicitly model the reality gap and try to unify the feature spaces of the source and target environments (domain adaptation); some train policies to generalize across diferent tasks ( meta-RL); some attempt to learn from demonstrations of behaviour in target environments (imitation learning); and others attempt to improve simulators. We refer the reader to [ 24, 25 ] for surveys of these methods. In this work, we draw parallels between some of these methods and developments in RL-based TSC. However, at the same time, TSC involves unique challenges that are usually not present in robotics. Environments in robotics where sim-to-real methods have been applied (see [ 24 ]) are usually highly controlled with well-defined objectives (e.g., [ 26 ]) and minimal interaction with other agents. However, TSC may be afected by varying environmental conditions and large numbers of road users.

2.3. Related reviews

Various reviews of applications of RL in TSC have been published. While each of the following reviews captures distinct aspects of the field that are highly relevant to our work, none of them have focused on the key issue of practical engineering challenges that present barriers to deployment, and — crucially — how to solve them instead of leaving them as open problems.

[ 27 ], [ 28 ], and [ 29 ] provided brief syntheses of early RL-based TSC methods in reviews of applications of AI in transportation. [ 30 ] and [ 31 ] were the first to take a systematic approach to reviewing RL-based TSC algorithms; the former performed the first experimental comparison of RL algorithms with a synthetic network, while the latter addressed data sources such as models of road networks and vehicle arrivals. Both reviewed state, action, and reward formulations. These reviews considered traditional algorithms in RL such as Q-learning and SARSA.

With the increasing popularity of deep learning to address challenges of scalability in RL, [ 4, 32 ] (the latter a follow-up to [ 31 ]) both reviewed deep RL methods for TSC and provided recommendations for designing novel deep RL-based TSC algorithms. [ 4 ] focused on choosing state, action, and reward representations, with some discussion of data processing, but did not – consider downstream challenges in deployment. [ 32 ] provided a broad overview of various algorithm and architecture designs with less of a focus on practicalities.

Both [ 15, 33 ] reviewed alternative state, action, and reward formulations among deep RLbased TSC algorithms, as well as options for inter-agent coordination and simulation-based evaluation. They outlined, but did not investigate, challenges to deployment. [ 15 ] further compared deep RL-based algorithms to traditional actuated and adaptive methods. Likewise, as part of a wider review on deep RL for intelligent trafic systems, [ 34 ] reviewed problem formulations and the history of algorithmic developments for RL-based TSC. Finally, [ 10 ] performed a highly systematic overview of the past 26 years of research in this domain that provides quantitative support for some of the patterns that we identify.

3. Uncertainty in detection 3.1. Significance of challenges

States are described in inputs to RL-based TSC algorithms using abstracted features. These include vehicles’ queue lengths, positions, and speeds [ 10 ]. Many works take for granted that these state features are readily available [ 35 ]. As reported by [ 10 ], 67% of surveyed papers did not envision any specific data sources. Even in papers where potential data sources were specified, it is unclear how robust the methods would be to detector noise or failure. For instance, among algorithms that use vehicle positions as state features, [ 36, 37, 38, 39 ] all used the simulator SUMO to obtain noiseless images of single-intersection toy networks; [ 40 ] extended this approach with a 3D simulator for images from the perspectives of trafic cameras; and [ 41 ] used simulated trafic in SUMO based on flow rates from trafic camera footage. Each of these methods provides a sanitized representation that may not necessarily be representative of real-world conditions. Furthermore, the loss of information to noise may cause state aliasing [ 42 ], which hinders the generalizability of learned policies to diferent demand scenarios [ 43 ].

3.2. Lessons from deployments

Types of instruments for trafic sensing include intrusive detectors (installed into the road surface) and non-intrusive detectors (mounted above the road surface) [ 44, 45 ]. Among intrusive detectors, loop detectors are relatively inexpensive, accurate, and robust to weather and time of day, but they are also highly vulnerable to wear and tear [ 46 ]. When they fail, loop detectors are being increasingly replaced by non-intrusive detectors such as video-based and radar detection systems [ 44 ], which can be flexibly reconfigured to detect diferent road segments and vehicle types. However, the accuracy of these systems degrades in inclement weather, and video detectors are also inaccurate at night and on high-speed roads [ 45, 47 ]. RL-based signal controllers must be designed with these limitations in mind; learning ensembles of models [ 48 ] to capture the strengths of diferent detectors may improve robustness. Although data about speed and position from connected vehicles can be useful, penetration remains low, so they must be integrated with traditional detector data. [ 49 ] showed in simulations that connected vehicle data could improve adaptive control even with limited penetration. Furthermore, agencies may configure their detectors diferently. To account for uncertainty in vehicle stopping positions, – for instance, the size of the detection zone behind the stop bar may vary [ 50 ]; detectors may also report data at diferent frequencies [ 51 ]. Thus, verifying the mapping from real detector data to abstract state representations is an important task for RL-based TSC.

Agencies often address problems in detection by modifying their detection setup [ 44 ] or by configuring parameters such as passage time (i.e., the amount of time that a phase is extended for upon actuation) [ 45 ]. [ 5 ] explicitly addressed error in queue length detection for their adaptive controller SURTRAC. To mitigate underestimation, they used heuristics based on diferences in vehicle counts reported by advance and stop bar detectors [ 52 ]. They considered overestimation acceptable, as it provides the algorithm with bufer time; similarly, [ 53 ] found that moderate queue length overestimation significantly improves the performance of adaptive control.

3.3. Progress toward solutions

Two lines of work within RL-based TSC have the potential to address detection uncertainty.

First, various authors have investigated the efects of reducing the dimensionality of the state space. In particular, [ 3 ] showed that complex image representations of intersection state achieve inferior performance compared to a simple representation containing only vehicle counts and phases. [ 54 ] reached similar conclusions with a state representation based on queue length. Both papers also provided optimality results that connected these formulations to traditional methods in TSC. Meanwhile, [ 43, 55 ] investigated the efects of switching to coarser state representations with a single algorithm. [ 55 ] found that occupancy and speed data (e.g., from loop detectors) yielded near-identical performance to high-fidelity position data (e.g., from cameras). However, the experiments of [ 43 ] suggested that coarser state discretizations harm generalization across sudden shifts in trafic flow. Regardless, simpler state representations could facilitate identification and debugging of issues caused by detection uncertainty.

Second, other work has attempted to imbue RL-based TSC algorithms with robustness to detection uncertainty. Several methods are analogous to domain randomization in the simto-real literature [ 26, 56 ]. The approach of [ 57 ] is closest to the sim-to-real literature: they randomize weather and lighting conditions in their trafic simulator and train policies based on the resulting images. [ 58 ] applied Dropout to neural network units to prevent overfitting and thus to learn robust policies. They evaluated their algorithm with a simulation of probabilistic detector failure. As is done in adversarial machine learning, [ 59 ] injected Gaussian noise into queue length observations, and validated their approach with simulations where trucks cause vehicle count overestimation. Meanwhile, to handle miscalibrated measurements, [ 35 ] combined next state prediction with imitation learning from a real trafic controller (SCOOTS), [ 60 ] used autoencoders to denoise input data, and [ 61 ] evaluated the efects of lane-blocking incidents and detector noise on performance. Finally, in a growing body of work that uses connected vehicle data for RL, [ 62 ] was the first to explicitly address partial observability by adding the phase duration into the state space to learn its indirect impact on delay.

Overall, these methods are helpful approaches for improving the robustness of RL-based TSC to detection uncertainty. However, they should be designed and tuned to address the challenges of specific deployments, leveraging past knowledge to identify and address potential causes of detector noise or failure. It may also help to model partial observability as part of the problem.

4. Reliability of communications 4.1. Significance of challenges

– Some level of controller decentralization is often applied in RL-based TSC, because the computational cost of RL may be prohibitive when the state and action space dimensionalities are high. At the same time, to ensure that controllers take the trafic conditions of other intersections into account for signalling decisions, a growing number of works have implemented mechanisms for inter-intersection coordination [ 33 ]. Typical approaches involve sharing states [ 63, 64, 65, 66, 67, 68 ], actions [ 69 ], or hidden state representations from neural networks [ 70, 71 ] between controllers for neighbouring intersections. While much of this work has focused on designing neural network architectures to leverage shared information (such as graph neural networks [ 66, 67, 70, 71 ]), less attention has been devoted to the mechanisms by which information must be exchanged in the first place. If there are inconsistencies in the availability of communication infrastructure and detectors between intersections (see also Section 3), it is unclear how they may afect the performance of RL-based TSC.

4.2. Lessons from deployments

In practice, signal controllers are commonly deployed as part of closed-loop systems, where control is distributed over three levels. At the top level, trafic management centres (TMCs) make policy-based signalling decisions, often involving dialogue with other stakeholders. These decisions are used to configure field master controllers (FMCs), which are installed on-site and coordinate multiple local intersection controllers (LICs) [ 72 ]. Each FMC aggregates trafic conditions reported by connected LICs to make signalling decisions over a small region; FMCs also synchronize the clocks of LICs to ensure that they are coordinated [ 12, 14 ]. As 90% of TSC systems in the United States are closed-loop [ 73 ], upgrades to adaptive control have largely been implemented within this hierarchical organization [ 51 ]. LICs may make some limited decisions based on local trafic conditions, but coordination is still largely delegated to FMCs even in adaptive control [ 72 ]. Transitioning to adaptive control has also required agencies to update to Type 2070 or ATC controllers [ 12 ], but some controllers in road networks may retain relatively outdated hardware [ 14 ]. RL-based signal controllers will likely be deployed into such ecosystems, where control is distributed hierarchically and diferent intersections have diferent capabilities for control and/or detection. Thus, algorithms based on techniques for domain adaptation from the sim-to-real literature may be helpful.

Messages are sent between controllers and TMCs using multiple communication media in modern TSC systems [ 12 ]. For wired connections, fibre optic cables are increasingly replacing traditional copper wires or coaxial cables. Wireless communication systems implemented using radio or Wi-Fi are also becoming increasingly common [ 44 ]. Thus, communication bandwidth is not likely to be a concern, except in jurisdictions where fibre optic infrastructure is not readily available. However, a major issue reported by agencies in [ 44 ] was connection reliability: poor signal strength often results in data loss or latency. In terms of data formatting, the NTCIP 1202 standard includes standard object definitions for actuated signal controllers, which has also been used for adaptive systems [ 73 ]. Communications for RL would need to fit into this standard, at least until it is updated (as has already been done for connected vehicles) [ 74 ]. In SURTRAC, [ 5 ] encoded data for communication between neighbouring intersections using JSON messages with standard types.

4.3. Progress toward solutions

One line of work in RL-based TSC has sought to learn more compact representations of information. Although bandwidth is not a concern, reducing message dimensionality could still mitigate the impact of communication failures. Several algorithms directly exchange state values of learned policies instead of learning from exchanged state representations. In [ 75, 76 ], state values are directly exchanged between neighbours and weighted; [ 37, 77, 78 ] leveraged the max-plus algorithm for coordination graphs, which is known to converge to near-optimality even for cyclic graphs [ 79 ]. Meanwhile, [ 80 ] designed an architecture to exchange information from the previous time step to ensure robustness to latency, and showed that it asymptotically reduces communication relative to neighbour-based approaches by 50%. [ 81 ] demonstrated that cumulative rewards can be estimated based only on vehicle counts on inbound approaches.

Some work has also focused on designing RL-based TSC algorithms for hierarchically distributed frameworks of communication and control, which could improve RL’s robustness, scalability, and applicability for deployment in closed-loop systems. [ 82 ] implemented a twolevel architecture where LICs can either act independently or receive joint actions from FMCs based on predictions of the regional trafic state. [ 63 ] introduced a feudal RL algorithm, in which “manager” controllers do not directly control the actions of “worker” controllers, but instead set goals that influence their rewards. [ 83 ] trained multiple sub-policies that minimize various proxy metrics such as queue length and waiting time, and a high-level controller that adaptively delegates control to sub-policies to minimize the longer-term metric of travel time. However, all of these architectures are conceptual and further work is needed to deploy them.

5. Compliance and interpretability 5.1. Significance of challenges

At the heart of the fact that RL-based TSC algorithms have not been deployed are the potential regulatory and safety risks that are introduced by RL [ 15, 34 ]. The issue of trust and safety for RL is by no means exclusive to the domain of TSC [ 84, 85, 86 ], but in this case the stakes are high because contollers must interact with a large number of human users and mistakes may have fatal consequences. For RL-based signal controllers to be trusted, we need to assess — both prospectively or retrospectively — whether their decisions comply with standards and reasonable expectations [ 87 ]. However, the proliferation of deep RL algorithms based on complicated state representations runs counter to this goal, as assessment of compliance is not possible if we cannot understand or at least verify their decisions. At the same time, issues of interpretability and safety have rarely been discussed in the literature on RL-based TSC [ 10 ] and are more often mentioned as desiderata for future work in reviews [ 10, 15, 34 ].

5.2. Lessons from deployments

– In the real world, regulatory frameworks for trafic signalling are often scattershot. In the United States, the federal Manual on Uniform Trafic Control Devices [ 88 ] includes standards about the necessity, meaning, and placement of diferent trafic signals. Many of these standards involve the control of individual movement signals, which would be abstracted away from RL through phase-based action space definitions. However, factors such as yellow change and red clearance intervals are left to “engineering judgement”. States may impose further requirements on signal timing plans based on regional transportation policies [ 14 ]. In a review of signal timing policies for 15 states, [ 89 ] found recommendations for factors such as minimum green, yellow change, and red clearance intervals, as well as when to serve turn movements. Such recommendations should be incorporated into the design of the RL action space, as was done by [ 5 ] who treated safety constraints as inputs to SURTRAC. Yet, these recommendations can also be arbitrary and dependent on data (e.g., vehicle and pedestrian clearing times [ 89 ]), and algorithmic approaches to stakeholder preference learning [ 90 ] may help to find better values.

One common strategy to ensure the safety of signal timing plans is to review common types and causes of crashes in historical data [ 89 ]. Naturally, this is a reactive approach that requires crashes to happen in the first place, and crash reports may also be biased by severity or by environmental conditions [ 14 ]. Accident modification factors (AMFs) are a popular method of quantitative analysis; they statistically estimate the efectiveness of particular changes to signal timing plans based on their expected reductions in crash rate [ 91, 92, 93 ]. We are unaware of any work in RL that estimates or uses AMFs, but they may be a valuable pathway to interpretability. The Highway Safety Manual also provides standard crash risk assessment models, but these models often require extensive tuning to local conditions [ 94, 95, 96 ].

5.3. Progress toward solutions

Some work has enhanced the interpretability of RL-based TSC through algorithm design. [ 97 ] focused on learning surrogate policies that are regulatable, i.e. monotonic in state variables, which allows parameters to be viewed as weights. [ 98 ] learned human-auditable decision tree surrogates using VIPER, an algorithm that identifies critical states where suboptimality harms future rewards. Closer to the literature on interpretability for machine learning, [ 99 ] used SHAP values to analyze how induction loop detections contribute to choices of phases for a controller in a simulated roundabout. They found that advance detectors have higher SHAP values as they are more indicative of congestion. Similarly, [ 57 ] used Grad-CAM to generate heatmaps for image-based inputs. Instead of directly interfacing with the simulator, [ 100 ] used logical rules based on signal controllers to post-process RL policy outputs for ensuring compliance.

Further work has applied heuristic modifications to RL algorithms to enforce safety. [ 101 ] prevented their system from taking actions when pedestrians are detected in crosswalks, and enforced minimum green times for pedestrians. [ 102 ] drew on their models of rear-end conflict rates (based on various observable intersection state features [ 103 ]) to design a reward formulation that minimizes such conflicts. Similarly, [ 104 ] used a binary logistic crash risk model to define crash penalties while also minimizing waiting time. Using a state formulation based on individual signals, [ 105 ] regularized the red light duration of signalling plans to mitigate unsafe behaviour caused by driver frustration with extended red lights. [ 106 ] included yellow change intervals in their action space and added a penalty for emergency braking by vehicles.

While we have reviewed many promising methods that have been developed for the interpretability and safety of RL-based TSC, more work is still needed on determining which of these methods correspond well to stakeholder requirements. Furthermore, there is a substantial literature on safe reinforcement learning using constrained optimization [ 107, 108, 109 ], which has hitherto not been applied to TSC; it is likely that such work can provide more rigorous theoretical guarantees about algorithm behaviour. We also believe that, to deal with safety failures ethically, work is needed in algorithmic accountability for RL-based signal controllers.

6. Heterogeneous road users 6.1. Significance of challenges

Traditional models of trafic flow used for TSC assume, simplistically, that all vehicles are identical [ 110, 111 ]. In reality, the assumption of identical or even unimodal trafic is often unrealistic, because many types of vehicles and road users — each with diferent needs and behavioural patterns — interact with each other on roads. RL algorithms can still implicitly encode these assumptions through simplistic state spaces, since common state variables such as queue length and vehicle position [ 15 ] do not account for inter-vehicle variation. Although such state formulations can be helpful for deriving optimality results based on traditional models in TSC [ 3, 54 ], it is unclear how these assumptions may impact the performance and safety of RL-based signal controllers in practice, especially because road users such as pedestrians and cyclists may behave non-intuitively. Dedicated simulators developed for RL-based TSC likewise abstract away inter-vehicle variation [ 112 ]. [ 10 ] found in 160 papers on RL-based TSC that only three accounted for non-private vehicle types, and only one accounted for pedestrians.

6.2. Lessons from deployments

In practice, agencies make a variety of adjustments to signalling plans to accommodate diferent classes of road users other than regular passenger vehicles, including pedestrians, cyclists, transit vehicles, and emergency vehicles [ 14 ]. In this section, we focus on current practice in the field for pedestrians and transit/emergency vehicles. When balancing the needs of diferent road user classes in RL-based signal controllers, stakeholders’ requirements should be taken into account; in the US, for instance, agencies’ opinions difer on whether preemption for trains should take priority over pedestrians [ 89 ].

For pedestrians, the simplest option is for the pedestrian signal to be activated in the direction of the through movement, as is implicitly assumed by many works in RL and made explicit in some (e.g., [ 113 ]). However, doing so may cause pedestrians to impede the flow of left-turning and right-turning trafic, which creates safety hazards. In practice, leading pedestrian intervals (LePIs) mitigate this risk by allowing pedestrians to start crossing before cars are permitted to make turns [ 14 ]. Alternative phase sequence designs add lagging pedestrian intervals (after turning phases) or phases exclusively for pedestrians. [ 114 ] developed a benefit-cost model to assess the safety-delay tradeofs for LePIs at individual intersections. Beyond safety, additional work has tried to minimize the delay of pedestrians so that they are treated equitably compared to drivers, as codified by regulations in Germany, the UK, and China [ 115 ]. For the deployed SURTRAC system, [ 116 ] adaptively set pedestrian walk intervals based on predicted phase lengths to avoid cutting them short, while [ 117 ] considered using vehicular volumes and pedestrian actuation frequencies to switch between controller modes. We are unaware of any work in RL that has explicitly included LePIs as part of the action space formulation.

As for handling transit and emergency vehicles, typical strategies include the prioritization and preemption of signals. Prioritization handles requests made by vehicles through vehicle-toinfrastructure (V2I) communications, and may or may not result in adjustments to signalling plans. Meanwhile, preemption (often used for firetrucks or trains) deterministically replaces the signal plan with a predefined routine that favours the preempting vehicle. Typically, signal controllers need multiple cycles after preemption to recover from the interruption [ 14 ]. The adaptive SCATS controller natively implements both prioritization and preemption; compared to prior practice, [ 118 ] found that SCATS’ performance improvements were robust to prioritization, and [ 119 ] found that it could reduce recovery time from preemption. These results suggest the potential of implementing prioritization and preemption with RL-based methods; in particular, explicit modelling of recovery from preemption may further improve recovery times. In addition to interactions at intersections, RL-based signal controllers should also consider the efects of transit and emergency vehicles on trafic between intersections. For instance, when buses are stopped on roads, they may block other trafic from passing. As initial steps towards implementing bus prioritization in the SURTRAC system, [ 120 ] delayed the allocation of green time in intersections located downstream from stopped buses, and [ 121 ] predicted bus dwelling times at stops by leveraging V2I communications.

6.3. Progress toward solutions

One paper in RL-based TSC was cited by [ 10 ] as explicitly modelling pedestrians: [ 101 ] defined the reward using the weighted average of the local intersection’s vehicular queue length, neighbouring intersections’ vehicular queue lengths, and the local intersection’s pedestrian queue length. Beyond this paper, several other works have explicitly considered pedestrians as part of the problem formulation. [ 122 ] likewise addressed joint vehicle-pedestrian control at intersections, but made no assumptions about pedestrian detector capabilities. [ 123 ] used deep RL to control a signalized crosswalk across a road (with the actions being to set the pedestrian signal to green or red), and found that it outperformed actuation under moderate levels of pedestrian demand in simulations. [ 61 ] analyzed the performance of RL-based TSC in the presence of jaywalking pedestrians that cause vehicles to slow.

Several works in RL-based TSC have also considered prioritization and preemption. For prioritization, [ 57 ] upweighted buses and emergency vehicles in their throughput-based reward formulation; [ 124 ] used a state representation based on the cell transmission trafic model and modelled priority as a binary variable; [ 125 ] adopted an implicit approach based on minimizing delay per person instead of per vehicle; [ 126 ] and [ 127 ] both considered prioritization for trams, with the former’s rewards being based on tram schedule adherence and the latter using model predictive control to model driver behaviour; and [ 128 ] adaptively altered vehicles’ priorities depending on queue length, waiting time, and emergency vehicle presence. For preemption, [ 129 ] learned TSC policies for emergency vehicle routing with rewards that encourage low vehicle density, and [ 130 ] used RL to learn policies for notifying connected vehicles to clear out lanes for emergency vehicles to pass.

Lastly, [ 100 ] included demand data from the field for multiple types of road users — including pedestrians, cyclists, motorcyclists, trucks, and buses — in their benchmark simulation for RL-based TSC, LemgoRL, which is based on a real road network; they also included pedestrian waiting times in rewards and enforced minimum pedestrian green times. There is a need to connect high-fidelity simulations such as LemgoRL to the various approaches for handling diferent road user classes that we outlined above, so as to ensure their ecological validity.

7. Conclusion

We have reviewed four barriers to the deployment of RL-based controllers for TSC. Each of these barriers has been insuficiently addressed by the majority of new work in RL-based TSC, which has focused on algorithmic contributions. However, TSC algorithms do not exist in a vacuum — they must be trained based on data from detectors, interface with signals through controllers, and control the movements of a variety of road users. Challenges both intrinsic to RL algorithms and in other pipeline components may cascade into failures with significant implications for the eficiency and safety of transportation infrastructure. Based on our literature review, we suggested ways in which further work in RL-based TSC could address these challenges.

Echoing the recommendations of [ 11 ], we emphasize the importance of engaging in consultation with agency stakeholders and experts in TSC for RL practitioners. This can break down information silos that would otherwise prevent the recognition of issues during requirements engineering and integration (cf. [ 131 ]); we could not have identified these challenges ourselves without engaging with the literature on traditional TSC. Additionally, as we discussed, the practicalities of these challenges — including the availability and configuration of detectors, signalling constraints, and the priorities of diferent road users — will often vary depending on the statuses of road networks and their responsible agencies. While benchmark simulations based on synthetic networks facilitate evaluation, we advocate for the creation of more simulations like [ 100 ] that incorporate realistic domain constraints. RL algorithms that are trained using such benchmarks would likely have better generalizability and robustness in deployments.

More generally, we uncovered a diversity of work that addresses each challenge, which previous reviews of TSC have not comprehensively surveyed. This suggests that RL-based TSC is closer to deployment than might be suggested by a review of state-of-the-art methods. If future developments focus on combining algorithmic improvements with both real-world considerations and reproducibility techniques to facilitate collaboration [ 132 ], we believe that the integration of RL to improve real-world transportation infrastructure is within reach.

Acknowledgments

The authors thank Christian Kästner, Eunsuk Kang, Stephanie Milani, Peide Huang, Ryan Shi, and Steven Jecmen for useful information and suggestions that they provided to support the drafting of this review. –

[1]

Schrank ,

Albert ,

Eisele ,

Lomax , 2021 Urban Mobility Report, Technical Report , Texas

& M Transportation Institute , 2021 .

[2]

Chin ,

Franzese ,

Greene ,

Hwang ,

Gibson , Temporary losses of highway capacity and impacts on performance: Phase 2 ,

Technical

Report ORNL /TM-2004/209, Oak Ridge National Laboratory, 2004 .

[3]

Zheng ,

Zang ,

Xu ,

Wei ,

Yu ,

Gayah ,

Xu ,

Li , Diagnosing reinforcement learning for trafic signal control, arXiv preprint ( 2019 ). arXiv: 1905 .04716.

[4]

Gregurić ,

Vujić ,

Alexopoulos ,

Miletić , Application of deep reinforcement learning in trafic signal control: An overview and impact of open trafic data , Applied Sciences 10 ( 2020 ) 4011 .

[5]

Smith ,

Barlow ,

X.-F.

Xie ,

Rubinstein , Smart urban signal networks: Initial application of the SURTRAC adaptive trafic signal control system , in: Proceedings of the 23rd International Conference on Automated Planning and Scheduling , ICAPS '13 , 2013 , pp. 434 - 442 .

[6]

Chen ,

Wei ,

Xu ,

Zheng ,

Yang ,

Xiong ,

Xu ,

Li , Toward a thousand lights: Decentralized deep reinforcement learning for large-scale trafic signal control , in: Proceedings of the 34th AAAI Conference on Artificial Intelligence , AAAI ' 20 , 2020 , pp. 3414 - 3421 .

[7]

Brown , T. Sandholm, Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , Science 359 ( 2017 ) 418 - 424 .

[8]

Vinyals , I. Babuschkin, W. M. C. amd Michaël

Mathieu

Dudzik ,

Chung ,

D. H.

Choi ,

Powell ,

Ewalds ,

Georgiev ,

Oh ,

Horgan ,

Kroiss , I. Danihelka ,

Huang ,

Sifre ,

Cai ,

J. P.

Agapiou ,

Jaderberg ,

A. S.

Vezhnevets ,

Leblond ,

Pohlen ,

Dalibard ,

Budden ,

Sulsky ,

Molloy ,

T. L.

Paine ,

Gulcehre ,

Wang ,

Pfaf ,

Wu ,

Ring ,

Yogatama ,

Wünsch ,

McKinney ,

Smith ,

Schaul ,

Lillicrap ,

Kavukcuoglu ,

Hassabis ,

Apps ,

Silver , Grandmaster level in StarCraft II using multi-agent reinforcement learning , Nature 575 ( 2019 ) 350 - 354 .

[9]

Z. T.

Qin ,

Tang ,

Jiao ,

Zhang ,

Xu ,

Zhu ,

Ye , Ride-hailing order dispatching at didi via reinforcement learning , INFORMS Journal on Applied Analytics 50 ( 2020 ) 272 - 286 .

[10]

Noaeen ,

Naik ,

Goodman ,

Crebo ,

Abrar ,

Z. S. H.

Abad ,

A. L.

Bazzan ,

Far , Reinforcement learning in urban network trafic signal control: A systematic literature review , Expert Systems with Applications 199 ( 2022 ) 116830 .

[11]

Perrault ,

Fang ,

Sinha , M. Tambe, AI for social impact: Learning and planning in the data-to-deployment pipeline, arXiv preprint ( 2019 ). arXiv: 2001 .00088.

[12]

R. L.

Gordon , W. Tighe, Trafic Control Systems Handbook, Federal Highway Administration , 2005 .

[13]

Zheng ,

Xiong ,

Zang ,

Feng ,

Wei ,

Zhang ,

Li ,

Xu ,

Li , Learning phase competition for trafic signal control , in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management , CIKM '19 , 2021 , pp. 1963 - 1972 .

[14]

Koonce ,

Rodegerdts ,

Lee ,

Quayle ,

Beaird ,

Braud ,

Bonneson ,

Tarnof , T. Urbanik, Trafic Signal Timing Manual, Federal Highway Administration, 2008 .

[15]

Wei ,

Zheng ,

Gayah ,

Li , A survey on trafic signal control methods, arXiv preprint ( 2019 ). arXiv: 1904 .08117.

[16]

Eom , B.-I. Kim , The trafic signal control problem for intersections: a review , European Transport Research Review 12 ( 2020 ) 50 .

[17] C. J. C. H. Watkins , P. Dayan , Q-learning , Machine Learning 8 ( 1992 ) 279 - 292 .

[18]

Mnih ,

Kavukcuoglu ,

Silver ,

A. A.

Rusu ,

Veness ,

M. G.

Bellemare ,

Graves ,

Riedmiller ,

A. K.

Fidjeland , G. Ostrovski,

Petersen ,

Beattie ,

Sadik , I. Antonoglou,

King ,

Kumaran ,

Wierstra ,

Legg ,

Hassabis , Human-level control through deep reinforcement learning , Nature 518 ( 2015 ) 529 - 533 .

[19]

Li , Deep reinforcement learning , arXiv preprint ( 2018 ). arXiv: 1810 .06339.

[20]

B. R.

Kiran ,

Sobh ,

Talpaert ,

Mannion ,

A. A. A.

Sallab ,

Yogamani ,

Pérez , Deep reinforcement learning for autonomous driving: A survey , IEEE Transactions on Intelligent Transportation Systems ( 2021 ).

[21]

Nazari ,

Oroojlooy ,

Takáč ,

L. V.

Snyder , Reinforcement learning for solving the vehicle routing problem , in: Proceedings of the 32nd International Conference on Neural Information Processing Systems , NIPS ' 18 , 2018 , pp. 9861 - 9871 .

[22]

R. S.

Sutton ,

A. G.

Barto , Early history of reinforcement learning , in: Reinforcement Learning: An Introduction , The MIT Press, 2018 , pp. 11 - 17 .

[23] J.-B. Mouret , K. Chatzilygeroudis , 20 years of reality gap: a few thoughts about simulators in evolutionary robotics , in: Proceedings of the 2017 Genetic and Evolutionary Computation Conference Companion, GECCO '17 , 2017 , pp. 1121 - 1124 .

[24]

Zhao ,

J. P.

Queralta , T. Westerlund, Sim-to-real transfer in deep reinforcement learning for robotics: a survey , in: Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence , SSCI ' 20 , 2020 , pp. 737 - 744 .

[25]

Dimitropoulos , I. Hatzilygeroudis,

Chatzilygeroudis , A brief survey of Sim2Real methods for robot learning , in: Proceedings of the 2022 International Conference on Robotics in Alpe-Adria Danube Region, RAAD '22 , 2022 , pp. 133 - 140 .

[26]

Andrychowicz ,

Baker ,

Chociej ,

Józefowicz ,

McGrew ,

Pachocki ,

Petron ,

Plappert ,

Powell ,

Ray ,

Schneider ,

Sidor ,

Tobin ,

Welinder ,

Weng ,

Zaremba , Learning dexterous in-hand manipulation , The International Journal of Robotics Research 39 ( 2020 ) 3 - 20 .

[27]

Abdulhai ,

Kattan , Reinforcement learning: Introduction to theory and potential for transport applications , Canadian Journal of Civil Engineering 30 ( 2003 ) 981 - 991 .

[28]

A. L. C.

Bazzan , Opportunities for multiagent systems and multiagent reinforcement learning in trafic control , Autonomous Agents and Multi-Agent Systems 18 ( 2009 ) 342 - 375 .

[29]

A. L. C.

Bazzan ,

Klügl , A review on agent-based technology for trafic and transportation , The Knowledge Engineering Review 29 ( 2013 ) 375 - 403 .

[30]

Mannion ,

Duggan , E. Howley, An experimental review of reinforcement learning algorithms for adaptive trafic signal control , in: Autonomic Road Transport Support Systems , Springer, 2016 , pp. 47 - 66 .

[31] K.-L. A. Yau , J.

Qadir , H. L.

Khoo , M. H.

Ling , P.

Komisarczuk , A survey on reinforcement learning models and algorithms for trafic signal control , ACM Computing Surveys 50 ( 2017 ) 34 .

[32]

Rasheed , K.-L. A. Yau , R. M.

Noor , C.

Wu , Y.-C. Low, Deep reinforcement learning for trafic signal control: A review , IEEE Access 8 ( 2020 ) 208016 - 208044 .

[33]

Wei ,

Zheng ,

Gayah ,

Li , Recent advances in reinforcement learning for trafic signal control: A survey of models and evaluation , ACM SIGKDD Explorations Newsletter 22 ( 2021 ) 12 - 18 .

[34]

Haydari ,

Yilmaz , Deep reinforcement learning for intelligent transportation systems: A survey , IEEE Transactions on Intelligent Transportation Systems 23 ( 2022 ) 11 - 32 .

[35]

Wang ,

Yuan ,

X. T.

Yang ,

Zhao ,

Liu , Deep Q learning-based trafic signal control algorithms: Model development and evaluation with field data , Journal of Intelligent Transportation Systems ( 2022 ).

[36]

Genders ,

Razavi , Using a deep reinforcement learning agent for trafic signal control, arXiv preprint ( 2016 ). arXiv: 1611 . 01142 .

[37]

E. van der

Pol ,

F. A.

Oliehoek , Coordinated deep reinforcement learners for trafic light control , in: Proceedings of the 30th Conference on Neural Information Processing Systems , NIPS ' 16 , 2016 , pp. 1 - 8 .

[38]

S. S.

Mousavi ,

Schukat , E. Howley, Trafic light control using deep policy-gradient and value-function based reinforcement learning , IET Intelligent Transport Systems 11 ( 2017 ) 417 - 423 .

[39]

Liang ,

Du ,

Wang ,

Han , A deep reinforcement learning network for trafic light cycle control , IEEE Transactions on Vehicular Technology 68 ( 2019 ) 1243 - 1253 .

[40]

Garg ,

Chli , G. Vogiatzis, Deep reinforcement learning for autonomous trafic light control , in: Proceedings of the 2018 3rd International Conference on Intelligent Transportation Engineering , ICITE ' 18 , 2018 , pp. 214 - 218 .

[41]

Wei , G. Zheng,

Yao ,

Li , IntelliLight: A reinforcement learning approach for intelligent trafic light control , in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '18 , 2018 , pp. 2496 - 2505 .

[42] M. T. J. Spaan , N. Vlassis , A point-based pomdp algorithm for robot planning , in: Proceedings of the 2004 IEEE International Conference on Robotics and Automation , ICRA '04 , 2004 , pp. 2399 - 2404 .

[43]

L. N.

Alegre ,

A. L.

Bazzan , B. C. da Silva, Quantifying the impact of non-stationarity in reinforcement learning-based trafic signal control , PeerJ Computer Science 7 ( 2021 ) e575 .

[44]

Sun ,

Dodoo ,

Rubio ,

H. K.

Penumala ,

Pratt ,

Sunkari , Synthesis study of Texas signal control systems: technical report, Technical Report FHWA/TX-13/0-6670-1 , Texas

& M Transportation Institute , 2012 .

[45]

Sunkari ,

Bibeka ,

Chaudhary ,

Balke , Impact of Trafic Signal Controller Settings on the Use of Advanced Detection Devices , Technical Report FHWA/TX-18/0-6934-R1 , Texas

& M Transportation Institute , 2019 .

[46]

Gibson , M. K. P. Mills , D. R. Jr ., Staying in the loop: The search for improved reliability of trafic sensing systems through smart test instruments , Public Roads 62 ( 1998 ).

[47]

Rhodes ,

D. M.

Bullock ,

J. R.

Sturdevant ,

Z. T.

Clark , Evaluation of Stop Bar Video Detection Accuracy at Signalized Intersections, Technical Report FHWA /IN/JTRP-2005/28, Joint Transportation Research Program, Indiana Department of Transportation and Purdue University, 2005 .

[48]

Lee ,

Laskin ,

Srinivas , P. Abbeel, SUNRISE: A simple unified framework for ensemble learning in deep reinforcement learning , in: Proceedings of the 38th International Conference on Machine Learning, ICML '21 , 2021 , pp. 6131 - 6141 .

[49]

S. M. A. B. A.

Islam ,

Tajalli ,

Mohebifard ,

Hajbabaie , Efects of connectivity and trafic observability on an adaptive trafic signal control system , Transportation Research Record 2675 ( 2021 ) 800 - 814 .

[50] A. M. T. Emtenan , C. M. Day , Impact of detector configuration on performance measurement and signal operations , Transportation Research Record 2674 ( 2020 ) 300 - 313 .

[51]

Luyanda ,

Gettman ,

Head ,

Shelby ,

Bullock , P. Mirchandani, ACS-Lite algorithmic architecture: Applying adaptive control system technology to closed-loop trafic signal control systems , Transportation Research Record 1856 ( 2003 ) 175 - 184 .

[52]

X.-F.

Xie ,

G. J.

Barlow ,

S. F.

Smith ,

Z. B.

Rubinstein , Accounting for Real-World Uncertainty in Real-Time Adaptive Trafic Control , Technical Report ATCSTR12 , Carnegie Mellon University, 2012 .

[53]

Cai ,

Hengst ,

Ye , E. Huang,

Wang ,

Aydos , G. Geers, On the performance of adaptive trafic signal control , in: Proceedings of the Second International Workshop on Computational Transportation Science, ICWTS '09 , 2009 , pp. 37 - 42 .

[54]

Wei ,

Chen , G. Zheng,

Wu ,

Gayah ,

Xu ,

Li , PressLight: Learning max pressure control to coordinate trafic signals in arterial network , in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '19 , 2019 , pp. 1290 - 1298 .

[55]

Genders ,

Razavi , Evaluating reinforcement learning state representations for adaptive trafic signal control , in: Proceedings of the 9th International Conference on Ambient Systems, Networks and Technologies, ANT '18 , 2018 , pp. 26 - 33 .

[56]

Tobin ,

Fong ,

Ray ,

Schneider ,

Zaremba ,

Abbeel , Domain randomization for transferring deep neural networks from simulation to the real world , in: Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS '17 , 2017 , pp. 23 - 30 .

[57]

Garg ,

Chli , G. Vogiatzis, Fully-autonomous, vision-based trafic signal control: from simulation to reality , in: Proceedings of the 21th International Conference on Autonomous Agents and MultiAgent Systems , AAMAS ' 22 , 2022 , pp. 454 - 462 .

[58]

Rodrigues ,

C. L.

Azevedo , Towards robust deep reinforcement learning for trafic signal control: Demand surges, incidents and sensor failures , in: Proceedings of the 2019 International Conference on Intelligent Transportation Systems, ITSC '19 , 2019 , pp. 3559 - 3566 .

[59] K. L. Tan , A.

Sharma , S.

Sarkar , Robust deep reinforcement learning for trafic signal control , Journal of Big Data Analytics in Transportation 2 ( 2020 ) 263 - 274 .

[60]

Li ,

Yan ,

Zhou ,

Wu ,

Wang , A regional trafic signal control strategy with deep reinforcement learning , in: Proceedings of the 37th Chinese Control Conference, CCC '18 , 2018 , pp. 7690 - 7695 .

[61]

Aslani ,

Seipel ,

M. S.

Mesgari ,

Wiering , Trafic signal optimization through discrete and continuous reinforcement learning with robustness analysis in downtown Tehran , Advanced Engineering Informatics 38 ( 2018 ) 639 - 655 .

[62]

Li ,

Zhao ,

Fu ,

Ruan , X. Di,

CVLight: Decentralized learning for adaptive trafic signal control with connected vehicles, arXiv preprint (

2021 ). arXiv: 2104 . 10340 .

[63]

Ma ,

Wu , Feudal multi-agent deep reinforcement learning for trafic signal control , in: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems , AAMAS ' 20 , 2020 , pp. 816 - 824 .

[64]

Chu ,

Wang ,

Codecà ,

Li , Multi-agent deep reinforcement learning for large-scale trafic signal control , IEEE Transactions on Intelligent Transportation Systems 21 ( 2020 ) 1086 - 1095 .

[65]

Xu ,

Wu ,

Huang ,

Zhou ,

Wang ,

Hu , Network-wide trafic signal control based on the discovery of critical nodes and deep reinforcement learning , Journal of Intelligent Transportation Systems 24 ( 2020 ) 1 - 10 .

[66]

Wang ,

Wu ,

Li ,

He , Trafic signal control with reinforcement learning based on region-aware cooperative strategy , IEEE Transactions on Intelligent Transportation Systems ( 2021 ).

[67]

Zeng , GraphLight: Graph-based reinforcement learning for trafic signal control , in: Proceedings of the 6th International Conference on Computer and Communication Systems, ICCCS '21 , 2021 , pp. 645 - 650 .

[68]

Zhou ,

Braud ,

Alhilal ,

Hui , J. Kangasharju, ERL: Edge based reinforcement learning for optimized urban trafic light control , in: Proceedings of the 3rd International Workshop on Smart Edge Computing and Networking , SmartEdge '19 , 2019 , pp. 849 - 854 .

[69]

Ge ,

Song ,

Wu ,

Ren , G. Tan, Cooperative deep Q-learning with Q-value transfer for multi-intersection signal control , IEEE Access 7 ( 2019 ) 40797 - 40809 .

[70]

Wei ,

Xu ,

Zhang , G. Zheng,

Zang ,

Chen ,

Zhang ,

Zhu ,

Xu ,

Li , CoLight: Learning network-level cooperation for trafic signal control , in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management , CIKM '19 , 2019 , pp. 1913 - 1922 .

[71]

Nishi ,

Otaki ,

Hayakawa , T. Yoshimura, Trafic signal control based on reinforcement learning with graph convolutional neural nets , in: Proceedings of the 2018 International Conference on Intelligent Transportation Systems, ITSC '18 , 2018 , pp. 877 - 883 .

[72]

Chen ,

J. J.

Lu , Comparison of current practical adaptive trafic control systems , in: Proceedings of the 10th International Conference of Chinese Transportation Professionals, ICCTP '10 , 2010 , pp. 1611 - 1619 .

[73]

Gettman ,

S. G.

Shelby ,

Head ,

D. M.

Bullock ,

Soyke , Data-driven algorithms for real-time adaptive tuning of ofsets in coordinated trafic signal systems , Transportation Research Record 2035 ( 2007 ) 1 - 9 .

[74]

Huang ,

Leslie ,

Balse , Infrastructure Connectivity Certification Test Procedures for Infrastructure-Based Connected Automated Vehicle Components: Test Procedures, Signal Phase and Timing - NTCIP 1202 v03, Technical Report FHWA-JPO-20-802 , Leidos, 2019 .

[75]

Wang ,

Geng ,

Li , Intelligent transportation control based on proactive complex event processing , in: Proceedings of the 3rd International Conference on Mechanics and Mechatronics Research , ICMMR ' 16 , 2016 , pp. 1 - 5 .

[76]

Liu , G. Qin,

He ,

Jiang , Distributed cooperative reinforcement learning-based trafic signal control that integrates V2X networks' dynamic clustering , IEEE Transactions on Vehicular Technology 66 ( 2017 ) 8667 - 8681 .

[77]

Yang ,

H.-S.

Wong ,

Kang , Cooperative trafic signal control using multi-step return and of-policy asynchronous advantage actor-critic graph algorithm , KnowledgeBased Systems 183 ( 2019 ) 104855 .

[78]

Chu ,

Wang , Trafic signal control by distributed reinforcement learning with minsum communication , in: Proceedings of the 2017 American Control Conference, ACC '17 , 2017 , pp. 5095 - 5100 .

[79]

J. R.

Kok ,

Vlassis , Using the max-plus algorithm for multiagent decision making in coordination graphs , in: Proceedings of the Fourth Robot Soccer World Cup, RoboCup '05 , 2005 , pp. 1 - 12 .

[80]

Xie ,

Wang ,

Chen ,

Dong , IEDQN: Information exchange DQN with a centralized coordinator for trafic signal control , in: Proceedings of the 2020 International Joint Conference on Neural Networks, IJCNN '20 , 2020 , pp. 1 - 8 .

[81]

Jiang ,

Qin ,

Shi ,

Sun ,

Zheng , Multi-agent reinforcement learning for trafic signal control through universal communication method, arXiv preprint ( 2022 ). arXiv: 2204 . 12190 .

[82]

Abdoos ,

A. L.

Bazzan , Hierarchical trafic signal optimization using reinforcement learning and trafic prediction with long-short term memory , Expert Systems with Applications 171 ( 2021 ) 114580 .

[83]

Xu ,

Wang ,

Jia ,

Lu , Hierarchically and cooperatively learning trafic signal control , in: Proceedings of the 35th AAAI Conference on Artificial Intelligence , AAAI ' 21 , 2021 , pp. 1 - 9 .

[84]

Brunke ,

Greef ,

A. W.

Hall ,

Yuan ,

Zhou ,

Panerati ,

A. P.

Schoellig , Safe learning in robotics: From learning-based control to safe reinforcement learning , Annual Review of Control, Robotics, and Autonomous Systems 5 ( 2022 ).

[85]

García ,

Fernández , A comprehensive survey on safe reinforcement learning , Journal of Machine Learning Research 16 ( 2015 ) 1437 - 1480 .

[86]

Yu ,

Liu ,

Nemati , G. Yin, Reinforcement learning in healthcare: A survey , ACM Computing Surveys 55 ( 2023 ) 1 - 36 .

[87]

F. R.

Ward , I. Habli , An assurance case pattern for the interpretability of machine learning in safety-critical systems , in: Proceedings of the 2020 International Conference on Computer Safety , Reliability, and Security, SAFECOMP '20 , 2020 , pp. 395 - 407 .

[88] DOT , Manual on Uniform Trafic Signal Control Devices, revision 2 ed., US Department of Transportation , 2012 .

[89]

Bonneson ,

Pratt ,

Zimmerman , Development of a Trafic Signal Operations Handbook , Technical Report FHWA/TX-09/0-5629-1 , Texas

& M Transportation Institute , 2009 .

[90] M. K. Lee , D.

Kusbit , A.

Kahng , J. T.

Kim , X.

Yuan , A.

Chan , D.

See , R.

Noothigattu , S.

Lee , A.

Psomas , A. D.

Procaccia , WeBuildAI: Participatory framework for algorithmic governance , Proceedings of the ACM on Human-Computer Interaction 3 ( 2019 ) 1 - 35 .

[91]

Lord ,

J. A.

Bonneson , Role and application of accident modification factors within highway design process , Transportation Research Record 1961 ( 2006 ) 65 - 73 .

[92]

Wu ,

Lord ,

Zou , Validation of crash modification factors derived from crosssectional studies with regression models , Transportation Research Record 2514 ( 2015 ) 88 - 96 .

[93]

Ma , M. D. Fontaine , F.

Zhou , J.

Hu , Estimation of crash modification factors for an adaptive trafic-signal control system , Journal of Transportation Engineering 142 ( 2016 ) 04016061 .

[94]

Sun ,

Li ,

Magri ,

H. H.

Shirazi , Application of Highway Safety Manual draft chapter: Louisiana experience , Transportation Research Record 1950 ( 2006 ) 55 - 64 .

[95]

Sun ,

Brown , P. Edara,

Carlos ,

Nam , Calibration of the Highway Safety Manual for Missouri , Technical Report 25-1121-0003-177 , Mid-America Transportation Center, 2013 .

[96]

Xie ,

Gladhill ,

K. K.

Dixon ,

C. M.

Monsere , Calibration of Highway Safety Manual predictive models for Oregon state highways , Transportation Research Record 2241 ( 2011 ) 19 - 28 .

[97]

Ault ,

J. P.

Hanna , G. Sharon, Learning an interpretable trafic signal control policy , in: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems , AAMAS ' 20 , 2020 , pp. 88 - 96 .

[98]

Jayawardana ,

Landler ,

Wu , Mixed autonomous supervision in trafic signal control , in: Proceedings of the 2021 International Conference on Intelligent Transportation Systems, ITSC '21 , 2021 , pp. 1767 - 1773 .

[99]

S. G.

Rizzo , G. Vantini,

Chawla , Reinforcement learning with explainability for trafic signal control , in: Proceedings of the 2019 International Conference on Intelligent Transportation Systems, ITSC '19 , 2019 , pp. 3567 - 3572 .

[100]

Müller ,

Rangras , G. Schnittker,

Waldmann ,

Friesen ,

Ferfers ,

Schreckenberg ,

Hufen ,

Jasperneite ,

Wiering , Towards real-world deployment of reinforcement learning for trafic signal control , in: Proceedings of the 20th IEEE International Conference on Machine Learning and Applications , ICMLA ' 21 , 2021 , pp. 507 - 514 .

[101]

Liu , L. Liu,

W.-P.

Chen , Intelligent trafic light control using distributed multi-agent Q learning , in: Proceedings of the 2017 International Conference on Intelligent Transportation Systems, ITSC '17 , 2017 , pp. 1 - 8 .

[102]

Essa , T. Sayed, Self-learning adaptive trafic signal control for real-time safety optimization , Accident Analysis & Prevention 146 ( 2020 ) 105713 .

[103]

Essa , T. Sayed, Trafic conflict models to evaluate the safety of signalized intersections at the cycle level , Transportation Research Part C: Emerging Technologies 89 ( 2018 ) 289 - 302 .

[104]

Gong ,

Abdel-Aty ,

Yuan , Q. Cai, Multi-objective reinforcement learning approach for improving safety at intersections with adaptive trafic signal control , Accident Analysis & Prevention 144 ( 2020 ) 105655 .

[105]

Liao ,

Liu ,

Wu ,

Zou ,

Pan ,

Sun ,

S. E.

Li , M. Zhang, Time diference penalized trafic signal timing by LSTM Q-network to balance safety and capacity at intersections , IEEE Access 8 ( 2020 ) 80086 - 80096 .

[106]

Yu ,

Guo ,

Zhao ,

Li ,

Rao , Smarter and safer trafic signal controlling via deep reinforcement learning , in: Proceedings of the 29th ACM International Conference on Information and Knowledge Management , CIKM '20 , 2020 , pp. 3345 - 3348 .

[107]

Bohez ,

Abdolmaleki ,

Neunert ,

Buchli ,

Heess ,

Hadsell , Value constrained model-free continuous control, arXiv preprint ( 2019 ). arXiv: 1902 .04623.

[108]

Ding ,

Zhang , T. Basar,

Jovanovic , Natural policy gradient primal-dual method for constrained Markov decision processes , in: Proceedings of the 34th International Conference on Neural Information Processing Systems , NeurIPS '20 , 2020 , pp. 8378 - 8390 .

[109]

Liu ,

Cen ,

Isenbaev ,

Liu ,

Z. S.

Wu ,

Li ,

Zhao , Constrained variational policy optimization for safe reinforcement learning , in: Proceedings of the 39th International Conference on Machine Learning, ICML '22 , 2022 , pp. 1 - 9 .

[110]

Branston , H. van Zuylen , Comparison of queue-length models at signalized intersections , Transportation Research 12 ( 1978 ) 47 - 53 .

[111]

Viloria ,

Courage ,

Avery , Comparison of queue-length models at signalized intersections , Transportation Research Record 1710 ( 2000 ) 222 - 230 .

[112]

Zhang , S. Feng, C. Liu,

Ding ,

Zhu ,

Zhou ,

Zhang ,

Yu ,

Jin ,

Li , CityFlow: A multi-agent reinforcement learning environment for large scale city trafic scenario , in: Proceedings of the 2019 World Wide Web Conference, WWW '19 , 2019 , pp. 3620 - 3624 .

[113]

Guo ,

Wang ,

C.-Y.

Chan ,

Askary , A reinforcement learning approach for intelligent trafic signal control at urban intersections , in: Proceedings of the 2019 International Conference on Intelligent Transportation Systems, ITSC '19 , 2019 , pp. 4242 - 4247 .

[114]

Sharma ,

Smaglik ,

Kothuri ,

Smith ,

Koonce , T. Huang, Leading pedestrian intervals: Treating the decision to implement as a marginal benefit-cost problem , Transportation Research Record 2620 ( 2017 ) 96 - 104 .

[115]

Tang ,

Boltze ,

Tian ,

Nakamura , Initial comparative analysis of international practice in road trafic signal control , in: Global Practices on Road Trafic Signal Control, Elsevier , 2019 , pp. 285 - 310 .

[116]

Smith , Surtrac for the People: Upgrading the Surtrac Pittsburgh Deployment to incorporate Pedestrian Friendly Extensions and Remote Monitoring Advances , Technical Report 01730614, Mobility21 , 2020 .

[117]

Kothuri ,

Kading ,

Smaglik ,

Sobie , Improving Walkability Through Control Strategies at Signalized Intersections, Technical Report NITC-RR-782, National Institute for Transportation and Communities , 2017 .

[118]

Slavin ,

Feng ,

Figliozzi , P. Koonce, Statistical study of the impact of adaptive trafic signal control on trafic and transit performance , Transportation Research Record 2356 ( 2016 ) 117 - 126 .

[119]

Peters , P. O'Brien , J. Pachman , Memorandum: Farmington Road Adaptive Trafic Control Benefits Analysis , Technical Report , DKS Associates, 2011 .

[120]

Mahendran ,

Smith ,

Hebert ,

X.-F.

Xie , Bus Detection for Adaptive Trafic Signal Control , Technical Report , Carnegie Mellon University, 2014 .

[121]

Smith ,

Isukapati ,

Bronstein ,

Igoe , Integrating transit signal priority with adaptive signal control in a connected vehicle environment: Phase 1 Final Report , Technical Report 01675986, Mobility21 , 2018 .

[122]

Yin ,

Menendez , A reinforcement learning method for trafic signal control at an isolated intersection with pedestrian flows , in: Proceedings of the 19th COTA International Conference of Transportation Professionals, CICTP '19 , 2019 , pp. 3123 - 3135 .

[123]

Zhang , J. Fricker, Investigating smart trafic signal controllers at signalized crosswalks: A reinforcement learning approach , in: Proceedings of the 7th International Conference on Models and Technologies for Intelligent Transportation Systems, MT-ITS '21 , 2021 , pp. 1 - 6 .

[124]

Chanloha ,

Chinrungrueng ,

Usaha ,

Aswakul , Cell transmission model-based multiagent Q-learning for network-scale signal control with transit priority , The Computer Journal 57 ( 2014 ) 451 - 468 .

[125]

S. M. A.

Shabestray ,

Abdulhai , Multimodal iNtelligent Deep (MiND) trafic signal controller , in: Proceedings of the 2019 International Conference on Intelligent Transportation Systems, ITSC '19 , 2019 , pp. 4532 - 4539 .

[126]

Zhang ,

Jiang ,

Wang , Schedule-driven signal priority control for modern trams using reinforcement learning , in: Proceedings of the 17th COTA International Conference of Transportation Professionals, CICTP '17 , 2017 , pp. 2122 - 2132 .

[127]

Guo ,

Wang , An integrated MPC and deep reinforcement learning approach to trams-priority active signal control , Control Engineering Practice 110 ( 2021 ) 104758 .

[128]

Kumar ,

S. S.

Rahman ,

Dhakad , An integrated MPC and deep reinforcement learning approach to trams-priority active signal control , IEEE Transactions on Intelligent Transportation Systems 22 ( 2021 ) 4919 - 4928 .

[129]

Su ,

Y. D.

Zhong ,

Dey , A . Chakraborty, EMVLight: A decentralized reinforcement learning framework for eficient passage of emergency vehicles , in: Proceedings of the 36th AAAI Conference on Artificial Intelligence , AAAI ' 22 , 2022 , pp. 1 - 11 .

[130]

Su ,

Shi ,

Chow ,

Jin , Dynamic queue-jump lane for emergency vehicles under partially connected settings: A multi-agent deep reinforcement learning approach , arXiv preprint ( 2021 ). arXiv: 2003 .01025.

[131]

Nahar ,

Zhou ,

Lewis ,

Kästner , Collaboration challenges in building ML-enabled systems: Communication, documentation, engineering, and process , in: Proceedings of the 44th International Conference on Software Engineering, ICSE '22 , 2022 , pp. 1 - 22 .

[132]

Pineau ,

Vincent-Lamarre ,

Sinha ,

Larivière ,

Beygelzimer , F.

d'Alché-

Buc , E.

Fox , H.

Larochelle , Improving reproducibility in machine learning research (a report from the NeurIPS 2019 Reproducibility Program) , Journal of Machine Learning Research 22 ( 2021 ) 1 - 20 .