GESA: A GEneral Scenario-Agnostic Reinforcement Learning for
                         Traffic Signal Control⋆
                         Haoyuan Jiang1 , Ziyue Li2,* , Zhishuai Li1 , Lei Bai3 , Hangyu Mao1 , Wolfgang Ketter2 and Rui Zhao1
                         1
                           Sensetime Research, China
                         2
                           University of Cologne, Germany
                         3
                           Shanghai AI Lab, China


                                          Abstract
                                          Reinforcement learning (RL) can automatically learn a better policy through a trial-and-error paradigm and has been adopted to
                                          revolutionize and optimize traditional traffic signal control systems that are usually based on handcrafted methods. However, most
                                          existing RL-based models are either based on a single scenario or multiple independent scenarios, where each scenario has a separate
                                          simulation environment with predefined road network topology and traffic signal settings. These models implement training and testing
                                          in the same scenario, thus being strictly tied up with the specific setting and sacrificing model generalization heavily. While a few
                                          recent models could be trained by multiple scenarios, they require a huge amount of manual labor to label the intersection structure,
                                          hindering the model’s generalization. In this work, we aim at a general framework that could eliminate heavy labeling and model a
                                          variety of scenarios simultaneously. To this end, we propose a general Scenario-Agnostic (GESA) reinforcement learning framework
                                          for traffic signal control with: (1) A general plug-in module to map all different intersections into a unified structure, freeing us from
                                          the heavy manual labor to specify the structure of intersections; (2) A unified state and action space design to keep the model input
                                          and output consistently structured; (3) A large-scale co-training with multiple scenarios, leading to a generic traffic signal control
                                          algorithm. GESA can automatically handle various structured intersections from various cities without human labeling, and it co-trains
                                          a generalist agent to control traffic signals for multiple cities together, which also demonstrates superior transferability in zero-shot
                                          settings. In experiments, we demonstrate our algorithm as the first one that can be co-trained with seven different scenarios without
                                          manual annotation and gets 13.27% higher rewards than baselines. When dealing with a new scenario, our model can still achieve
                                          9.39% higher rewards. The code, scenarios, and demos are available here. The full paper is available at [1].

                                          Keywords
                                          Traffic signal control, Reinforcement learning, A generalist agent, Zero-shot transfer


                             Reinforcement learning (RL) [2, 3, 4] has been preferably                                                    and slower convergence [8, 10] or only narrow the scale of
                         adopted into the TSC domain since it is a learning-based                                                         a scenario to only one intersection in one scenario, which
                         method with higher automation. Such a trial-and-error                                                            heavily limits the model’s generalization [9].
                         paradigm based on the traffic simulator has demonstrated                                                            Moreover, current RL-based TSC methods are trained
                         better performance than transport engineering-based meth-                                                        with several pre-defined and fixed scenarios, whereas they
                         ods [5]. The recent RL-based TSC models can be roughly                                                           cannot gain generalization capability without labeling,
                         divided into two categories based on the scenarios where                                                         which limits the application of RL-based methods in the
                         the training and testing are conducted. A scenario is usually                                                    real world. These methods can exploit the various traffic
                         a simulation environment that contains a set of intersec-                                                        flows generated by the simulator to make the model effec-
                         tions: (1) Single-scenario RL, as the majority, its training                                                     tive in training scenarios, but finding a low-cost universal
                         and testing need to be on the same scenario [2, 6, 7]. How-                                                      method with promising transferability meanwhile is still
                         ever, the model will be unusable or perform badly in a new                                                       a research gap. As a result, the existing methods still face
                         scenario. For example, in Fig. 1(b) top, these methods might                                                     tremendous challenges in jumping out from the simulation
                         be trained and tested in the same scenario with 5 × 5 four-                                                      and implementing them in real cities. This is known as
                         approach intersections of Fig. 1(a2), but these methods will                                                     sim2real challenge. The challenges mainly come from the
                         ill-perform in another new scenario with mixed intersec-                                                         wide gap between the real complex cities and the simplified
                         tions of Fig. 1(a1) and 1(a2). (2) Multi-scenario RL, as                                                         simulation systems. In the real world, the intersection struc-
                         shown in Fig. 1(b) bottom, where training is conducted                                                           ture could be rather versatile in terms of different settings
                         in multiple scenarios, and testing could be in different sce-                                                    of approaches (i.e., north, south, east, west), movements (i.e.,
                         narios. For example, [8, 9, 10] are proposed to train a TSC                                                      left, right, through), and lanes (e.g., two through lanes, one
                         system with multi-scenarios. However, in the training stage,                                                     right-through lane). As shown in Fig. 1(a), an intersection
                         the existing multi-scenario RL models need heavy manual                                                          could have a different amount of approaches. Within an ap-
                         labor to annotate the structure of intersections, such as the                                                    proach, there can be different combinations of movements;
                         direction of each entering approach, the number of entering                                                      A lane could also combine different movements. However,
                         lanes of each entering approach, the traffic movement of                                                         most of the existing methods only consider a standard sim-
                         each entering lane, etc. Moreover, they either achieve multi-                                                    ulation intersection with four approaches and three lanes
                         scenario co-training in a sequential manner, one scenario                                                        (right, through, and left) within each approach. This largely
                         after another, leading to a rather unstable learning curve                                                       limits the model generalization.
                                                                                                                                             To conclude, a qualified TSC approach needs high gen-
                         STRL’24: Third International Workshop on Spatio-Temporal Reasoning                                               eralization and effectiveness: it should handle various in-
                         and Learning, 5 August 2024, Jeju, South Korea                                                                   tersections and be able to transfer to other unseen targets
                         ⋆
                           You can use this document as the template for preparing your publica-                                          easily and with low cost. In this paper, we aim to answer
                           tion. We recommend using the latest version of the ceurart style.                                              three questions: (1) How do we co-train an RL with multi-
                         *
                           Corresponding author.                                                                                          scenarios without labeling, given the diverse intersection
                         $ jianghaoyuan@zju.edu.cn (H. Jiang); zlibn@wiso.uni-koeln.de
                                                                                                                                          structures? (2) Will multi-scenario co-training improve the
                         (Z. Li)
                                  © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-   TSC? If yes, why? And the more scenarios, the better? (3)
                                  tion 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                                                                                                         Single-scenario RL


                                                                                                                                        h
                                                                     North


                                                                                                                                   h ug
                                                                                                                      𝒗!


                                                                                                                           Le roug Thro
                                                                                                                                                                Train


                                                                                                                            Th ight-
                                                                                                                                                         RL
                                                                                            𝒗!!               𝜃                                         Model
                     North


                                                                                                                             R

                                                                                                                             ft
                                                                                                                                                                  ✅
                                                                                                                                                                        Adapt   🚫

                 Right lane
                 Left lane                                                                 Le
                                                       West                               Th ft
                                                                              East       Th rough
                                                                                         Rigrhough
                                                                                               t                                              Rig
                                                                                                                                             Th ht       Multi-scenario RL
 West                                East       Right-Through-Left                                                                          Th rough
                                                                                                                                            Leftrough
                                 Right lane                                                                                                                     Train
                                 Through lane                                                                                                            RL                         …
                                 Through lane                                                                                                           Model    ✅
                                 Through lane
                                                                                                                                                                        Adapt   ✅


                                                                                                           Th ou t
                                                                                                        ht- Thr Lef
                                                                                                             rou gh
     Left lane


                                                                                                                gh
 Through lane
 Through lane
 Through lane
                                                                     South


                                                                                                     Rig
  (a1) Intersection of three approaches,        (a2) Intersection of four approaches,          (a3) Intersection of four                                (b) Single-scenario
  with 2-4 entering lanes, and 2                with 1 entering lane of right-through-         approaches, with 4 entering                              Training v.s. Multi-
  movements on each approach                    left movement on each approach                 lanes, and 3 movements,                                  scenario Training
                                                                                                                                                          Training scenario
                                                                                               yet with an arbitrary angel                                Adapting scenario


Figure 1: (a) Three intersections with different structures in terms of approach, movement, and lane; (b) Single-scenario RL v.s. Multi-
scenario RL.


Does the co-trained RL model still perform well in the new                      [2] E. Van der Pol, F. A. Oliehoek, Coordinated deep rein-
scenario?                                                                           forcement learners for traffic light control, Proceed-
   To narrow the sim2real gap significantly and get more                            ings of Learning, Inference and Control of Multi-agent
ready to be deployed in real cities, in this paper, we provide a                    Systems (at NIPS 2016) 1 (2016).
GEneral Scenario-Agnostic (GESA) reinforcement learning                         [3] H. Wei, G. Zheng, H. Yao, Z. Li, Intellilight: A rein-
framework for the TSC task. To our best knowledge, GESA                             forcement learning approach for intelligent traffic light
is the first work that pursues high generability and co-trains                      control, in: Proceedings of the 24th ACM SIGKDD
multiple scenarios without labels: it automatically handles                         International Conference on Knowledge Discovery &
various scenarios; the reinforcement learning is designed                           Data Mining, 2018, pp. 2496–2505.
accordingly to achieve generalization; it is co-trained with                    [4] H. Wei, C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu,
multiple scenarios simultaneously and demonstrates high                             Z. Li, Presslight: Learning max pressure control to
transferability. Specifically, to co-train in multiple scenar-                      coordinate traffic signals in arterial network, in: Pro-
ios with various intersections, the vectors with approach                           ceedings of the 25th ACM SIGKDD International Con-
spatial information are employed to map shape-odd and                               ference on Knowledge Discovery & Data Mining, 2019,
complex intersections into the standard intersection. Then,                         pp. 1290–1298.
the mapped intersections are used to generate the charac-                       [5] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, P. Komis-
teristic information of each traffic movement and the phase                         arczuk, A survey on reinforcement learning models
of the traffic lights in a specific order. Finally, we extend                       and algorithms for traffic signal control, ACM Com-
the original FRAP [6] to a policy gradient-based framework,                         puting Surveys (CSUR) 50 (2017) 1–38.
which can facilitate the model coverage and is compatible                       [6] G. Zheng, Y. Xiong, X. Zang, J. Feng, H. Wei, H. Zhang,
with different intersections.                                                       Y. Li, K. Xu, Z. Li, Learning phase competition for
   The contributions are summarized in three-fold: (1) We                           traffic signal control, in: Proceedings of the 28th ACM
present a general plug-in module to map the intersections                           International Conference on Information and Knowl-
into a unified structure, freeing us from the heavy man-                            edge Management, 2019, pp. 1963–1972.
ual labeling work to specify the intersection structure and                     [7] L. Zhu, P. Peng, Z. Lu, Y. Tian, MTLight: Efficient
enabling large-scale co-training under multiple different sce-                      multi-task reinforcement learning for traffic signal
narios. (2) Accordingly, we design a unified state and action                       control, in: ICLR 2022 Workshop on Gamification and
space to keep the model input and output structure consis-                          Multiagent Solutions, 2022.
tent for more general capabilities. Moreover, the GESA can                      [8] X. Zang, H. Yao, G. Zheng, N. Xu, K. Xu, Z. Li, Met-
adapt to various unseen scenarios and achieve promising                             alight: Value-based meta-reinforcement learning for
performance without re-training. (3) We build two real-                             traffic signal control, in: Proceedings of the AAAI
world scenarios using the real city road map and the real                           Conference on Artificial Intelligence, volume 34, 2020,
traffic dynamics, together with five public scenarios, where                        pp. 1153–1160.
we co-train and validate the GESA with prudent experi-                          [9] A. Oroojlooy, M. Nazari, D. Hajinezhad, J. Silva, At-
ments. All these lead us closer to the ultimate goal: to                            tendlight: Universal attention-based reinforcement
implement RL-based TSC in real cities.                                              learning model for traffic signal control, Advances
                                                                                    in Neural Information Processing Systems 33 (2020)
                                                                                    4079–4090.
References                                                                     [10] M. Wang, Y. Xu, X. Xiong, Y. Kan, C. Xu, M.-O. Pun,
                                                                                    Adlight: A universal approach of traffic signal control
 [1] H. Jiang, Z. Li, Z. Li, L. Bai, H. Mao, W. Ketter, R. Zhao,
                                                                                    with augmented data using reinforcement learning,
     A general scenario-agnostic reinforcement learning
                                                                                    arXiv preprint arXiv:2210.13378 (2022).
     for traffic signal control, IEEE Transactions on Intelli-
     gent Transportation Systems (2024) 1–15.