GESA: A GEneral Scenario-Agnostic Reinforcement Learning for Traffic Signal Control⋆ Haoyuan Jiang1 , Ziyue Li2,* , Zhishuai Li1 , Lei Bai3 , Hangyu Mao1 , Wolfgang Ketter2 and Rui Zhao1 1 Sensetime Research, China 2 University of Cologne, Germany 3 Shanghai AI Lab, China Abstract Reinforcement learning (RL) can automatically learn a better policy through a trial-and-error paradigm and has been adopted to revolutionize and optimize traditional traffic signal control systems that are usually based on handcrafted methods. However, most existing RL-based models are either based on a single scenario or multiple independent scenarios, where each scenario has a separate simulation environment with predefined road network topology and traffic signal settings. These models implement training and testing in the same scenario, thus being strictly tied up with the specific setting and sacrificing model generalization heavily. While a few recent models could be trained by multiple scenarios, they require a huge amount of manual labor to label the intersection structure, hindering the model’s generalization. In this work, we aim at a general framework that could eliminate heavy labeling and model a variety of scenarios simultaneously. To this end, we propose a general Scenario-Agnostic (GESA) reinforcement learning framework for traffic signal control with: (1) A general plug-in module to map all different intersections into a unified structure, freeing us from the heavy manual labor to specify the structure of intersections; (2) A unified state and action space design to keep the model input and output consistently structured; (3) A large-scale co-training with multiple scenarios, leading to a generic traffic signal control algorithm. GESA can automatically handle various structured intersections from various cities without human labeling, and it co-trains a generalist agent to control traffic signals for multiple cities together, which also demonstrates superior transferability in zero-shot settings. In experiments, we demonstrate our algorithm as the first one that can be co-trained with seven different scenarios without manual annotation and gets 13.27% higher rewards than baselines. When dealing with a new scenario, our model can still achieve 9.39% higher rewards. The code, scenarios, and demos are available here. The full paper is available at [1]. Keywords Traffic signal control, Reinforcement learning, A generalist agent, Zero-shot transfer Reinforcement learning (RL) [2, 3, 4] has been preferably and slower convergence [8, 10] or only narrow the scale of adopted into the TSC domain since it is a learning-based a scenario to only one intersection in one scenario, which method with higher automation. Such a trial-and-error heavily limits the model’s generalization [9]. paradigm based on the traffic simulator has demonstrated Moreover, current RL-based TSC methods are trained better performance than transport engineering-based meth- with several pre-defined and fixed scenarios, whereas they ods [5]. The recent RL-based TSC models can be roughly cannot gain generalization capability without labeling, divided into two categories based on the scenarios where which limits the application of RL-based methods in the the training and testing are conducted. A scenario is usually real world. These methods can exploit the various traffic a simulation environment that contains a set of intersec- flows generated by the simulator to make the model effec- tions: (1) Single-scenario RL, as the majority, its training tive in training scenarios, but finding a low-cost universal and testing need to be on the same scenario [2, 6, 7]. How- method with promising transferability meanwhile is still ever, the model will be unusable or perform badly in a new a research gap. As a result, the existing methods still face scenario. For example, in Fig. 1(b) top, these methods might tremendous challenges in jumping out from the simulation be trained and tested in the same scenario with 5 × 5 four- and implementing them in real cities. This is known as approach intersections of Fig. 1(a2), but these methods will sim2real challenge. The challenges mainly come from the ill-perform in another new scenario with mixed intersec- wide gap between the real complex cities and the simplified tions of Fig. 1(a1) and 1(a2). (2) Multi-scenario RL, as simulation systems. In the real world, the intersection struc- shown in Fig. 1(b) bottom, where training is conducted ture could be rather versatile in terms of different settings in multiple scenarios, and testing could be in different sce- of approaches (i.e., north, south, east, west), movements (i.e., narios. For example, [8, 9, 10] are proposed to train a TSC left, right, through), and lanes (e.g., two through lanes, one system with multi-scenarios. However, in the training stage, right-through lane). As shown in Fig. 1(a), an intersection the existing multi-scenario RL models need heavy manual could have a different amount of approaches. Within an ap- labor to annotate the structure of intersections, such as the proach, there can be different combinations of movements; direction of each entering approach, the number of entering A lane could also combine different movements. However, lanes of each entering approach, the traffic movement of most of the existing methods only consider a standard sim- each entering lane, etc. Moreover, they either achieve multi- ulation intersection with four approaches and three lanes scenario co-training in a sequential manner, one scenario (right, through, and left) within each approach. This largely after another, leading to a rather unstable learning curve limits the model generalization. To conclude, a qualified TSC approach needs high gen- STRL’24: Third International Workshop on Spatio-Temporal Reasoning eralization and effectiveness: it should handle various in- and Learning, 5 August 2024, Jeju, South Korea tersections and be able to transfer to other unseen targets ⋆ You can use this document as the template for preparing your publica- easily and with low cost. In this paper, we aim to answer tion. We recommend using the latest version of the ceurart style. three questions: (1) How do we co-train an RL with multi- * Corresponding author. scenarios without labeling, given the diverse intersection $ jianghaoyuan@zju.edu.cn (H. Jiang); zlibn@wiso.uni-koeln.de structures? (2) Will multi-scenario co-training improve the (Z. Li) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu- TSC? If yes, why? And the more scenarios, the better? (3) tion 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Single-scenario RL h North h ug 𝒗! Le roug Thro Train Th ight- RL 𝒗!! 𝜃 Model North R ft ✅ Adapt 🚫 Right lane Left lane Le West Th ft East Th rough Rigrhough t Rig Th ht Multi-scenario RL West East Right-Through-Left Th rough Leftrough Right lane Train Through lane RL … Through lane Model ✅ Through lane Adapt ✅ Th ou t ht- Thr Lef rou gh Left lane gh Through lane Through lane Through lane South Rig (a1) Intersection of three approaches, (a2) Intersection of four approaches, (a3) Intersection of four (b) Single-scenario with 2-4 entering lanes, and 2 with 1 entering lane of right-through- approaches, with 4 entering Training v.s. Multi- movements on each approach left movement on each approach lanes, and 3 movements, scenario Training Training scenario yet with an arbitrary angel Adapting scenario Figure 1: (a) Three intersections with different structures in terms of approach, movement, and lane; (b) Single-scenario RL v.s. Multi- scenario RL. Does the co-trained RL model still perform well in the new [2] E. Van der Pol, F. A. Oliehoek, Coordinated deep rein- scenario? forcement learners for traffic light control, Proceed- To narrow the sim2real gap significantly and get more ings of Learning, Inference and Control of Multi-agent ready to be deployed in real cities, in this paper, we provide a Systems (at NIPS 2016) 1 (2016). GEneral Scenario-Agnostic (GESA) reinforcement learning [3] H. Wei, G. Zheng, H. Yao, Z. Li, Intellilight: A rein- framework for the TSC task. To our best knowledge, GESA forcement learning approach for intelligent traffic light is the first work that pursues high generability and co-trains control, in: Proceedings of the 24th ACM SIGKDD multiple scenarios without labels: it automatically handles International Conference on Knowledge Discovery & various scenarios; the reinforcement learning is designed Data Mining, 2018, pp. 2496–2505. accordingly to achieve generalization; it is co-trained with [4] H. Wei, C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, multiple scenarios simultaneously and demonstrates high Z. Li, Presslight: Learning max pressure control to transferability. Specifically, to co-train in multiple scenar- coordinate traffic signals in arterial network, in: Pro- ios with various intersections, the vectors with approach ceedings of the 25th ACM SIGKDD International Con- spatial information are employed to map shape-odd and ference on Knowledge Discovery & Data Mining, 2019, complex intersections into the standard intersection. Then, pp. 1290–1298. the mapped intersections are used to generate the charac- [5] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, P. Komis- teristic information of each traffic movement and the phase arczuk, A survey on reinforcement learning models of the traffic lights in a specific order. Finally, we extend and algorithms for traffic signal control, ACM Com- the original FRAP [6] to a policy gradient-based framework, puting Surveys (CSUR) 50 (2017) 1–38. which can facilitate the model coverage and is compatible [6] G. Zheng, Y. Xiong, X. Zang, J. Feng, H. Wei, H. Zhang, with different intersections. Y. Li, K. Xu, Z. Li, Learning phase competition for The contributions are summarized in three-fold: (1) We traffic signal control, in: Proceedings of the 28th ACM present a general plug-in module to map the intersections International Conference on Information and Knowl- into a unified structure, freeing us from the heavy man- edge Management, 2019, pp. 1963–1972. ual labeling work to specify the intersection structure and [7] L. Zhu, P. Peng, Z. Lu, Y. Tian, MTLight: Efficient enabling large-scale co-training under multiple different sce- multi-task reinforcement learning for traffic signal narios. (2) Accordingly, we design a unified state and action control, in: ICLR 2022 Workshop on Gamification and space to keep the model input and output structure consis- Multiagent Solutions, 2022. tent for more general capabilities. Moreover, the GESA can [8] X. Zang, H. Yao, G. Zheng, N. Xu, K. Xu, Z. Li, Met- adapt to various unseen scenarios and achieve promising alight: Value-based meta-reinforcement learning for performance without re-training. (3) We build two real- traffic signal control, in: Proceedings of the AAAI world scenarios using the real city road map and the real Conference on Artificial Intelligence, volume 34, 2020, traffic dynamics, together with five public scenarios, where pp. 1153–1160. we co-train and validate the GESA with prudent experi- [9] A. Oroojlooy, M. Nazari, D. Hajinezhad, J. Silva, At- ments. All these lead us closer to the ultimate goal: to tendlight: Universal attention-based reinforcement implement RL-based TSC in real cities. learning model for traffic signal control, Advances in Neural Information Processing Systems 33 (2020) 4079–4090. References [10] M. Wang, Y. Xu, X. Xiong, Y. Kan, C. Xu, M.-O. Pun, Adlight: A universal approach of traffic signal control [1] H. Jiang, Z. Li, Z. Li, L. Bai, H. Mao, W. Ketter, R. Zhao, with augmented data using reinforcement learning, A general scenario-agnostic reinforcement learning arXiv preprint arXiv:2210.13378 (2022). for traffic signal control, IEEE Transactions on Intelli- gent Transportation Systems (2024) 1–15.