Reinforcement Learning for Optimization of COVID-19 Mitigation Policies Varun Kompella* 1 , Roberto Capobianco* 1, 2 , Stacy Jong3 , Jonathan Browne3 , Spencer Fox3 , Lauren Meyers3 , Peter Wurman1 , Peter Stone1, 3 1 Sony AI 2 Sapienza University of Rome 3 The University of Texas at Austin * Joint First Authors, varun.kompella@sony.com, roberto.capobianco@sony.com Abstract For such learned policies to be relevant, they must be The year 2020 has seen the COVID -19 virus lead to one of trained within an epidemiological model that accurately sim- the worst global pandemics in history. As a result, govern- ulates the spread of the pandemic, as well as the effects of ments around the world are faced with the challenge of pro- government measures. To the best of our knowledge, none of tecting public health, while keeping the economy running to the existing epidemiological simulations have the resolution the greatest extent possible. Epidemiological models provide to allow reinforcement learning to explore the regulations insight into the spread of these types of diseases and predict that governments are currently struggling with. the effects of possible intervention policies. However, to date, Motivated by this, our main contributions are: the even the most data-driven intervention policies rely on heuristics. In this paper, we study how reinforcement learn- 1. The introduction of PANDEMIC S IMULATOR, a novel ing (RL) can be used to optimize mitigation policies that min- open-source1 agent-based simulator that models the in- imize the economic impact without overwhelming the hospi- teractions between individuals at specific locations within tal capacity. Our main contributions are (1) a novel agent- based pandemic simulator which, unlike traditional models, is a community. Developed in collaboration between AI able to model fine-grained interactions among people at spe- researchers and epidemiologists (the co-authors of this cific locations in a community; and (2) an RL-based method- paper), PANDEMIC S IMULATOR models realistic effects ology for optimizing fine-grained mitigation policies within such as testing with false positive/negative rates, imper- this simulator. Our results validate both the overall simulator fect public adherence to social distancing measures, con- behavior and the learned policies under realistic conditions. tact tracing, and variable spread rates among infected individuals. Crucially, PANDEMIC S IMULATOR models 1 Introduction community interactions at a level of detail that allows the Motivated by the devastating COVID -19 pandemic, much spread of the disease to be an emergent property of peo- of the scientific community, across numerous disciplines, ple’s behaviors and the government’s policies. An in- is currently focused on developing safe, quick, and effec- terface with OpenAI Gym (Brockman et al. 2016) is pro- tive methods to prevent the spread of biological viruses, or vided to enable support for standard RL libraries; to otherwise mitigate the harm they cause. These methods 2. A demonstration that a reinforcement learning algorithm include vaccines, treatments, public policy measures, eco- can indeed identify a policy that outperforms a range of nomic stimuli, and hygiene education campaigns. Govern- reasonable baselines within this simulator; ments around the world are now faced with high-stakes deci- sions regarding which measures to enact at which times, of- 3. An analysis of the resulting learned policy, which may ten involving trade-offs between public health and economic provide insights regarding the relative efficacy of past and resiliency. When making these decisions, governments often potential future COVID -19 mitigation policies. rely on epidemiological models that predict and project the course of the pandemic. While the resulting policies have not been implemented in The premise of this paper is that the challenge of miti- any real-world communities, this paper establishes the po- gating the spread of a pandemic while maximizing personal tential power of RL in an agent-based simulator, and may freedom and economic activity is fundamentally a sequen- serve as an important first step towards real-world adoption. tial decision-making problem: the measures enacted on one The remainder of the paper is organized as follows. We day affect the challenges to be addressed on future days. As first discuss related work and then introduce our simula- such, modern reinforcement learning (RL) algorithms are tor in Section 3. Section 4 presents our reinforcement learn- well-suited to optimize government responses to pandemics. ing setup, while results are reported in Section 5. Finally, Section 6 reports some conclusions and directions for future AAAI Fall 2020 Symposium on AI for Social Good. Copyright © 2020, for this paper by its authors. Use permitted un- work. der Creative Commons License Attribution 4.0 International (CC 1 BY 4.0). https://github.com/SonyAI/PandemicSimulator Covid19 Simulator 2 Related Work Epidemiological models differ based on the level of granu- Covid Regulation Update Sim Characteristics (Location Rules, People Behaviors, Stage...) larity in which they track individuals and their disease states. “Compartmental” models group individuals of similar dis- Government Covid Testing Step 24 times (full day) ease states together, assume all individuals within a spe- Observation Infection Model cific compartment to be homogeneous, and only track the Contact Tracing Persons flow of individuals between compartments (Tolles and Lu- ong 2020). While relatively simplistic, these models have been used for decades and continue to be useful for both Figure 1: Block diagram of the simulator. retrospective studies and forecasts as were seen during the emergence of recent diseases (Rivers and Scarpino 2018; Metcalf and Lessler 2017; Cobey 2020). it accurately models the population and spread dynamics However, the commonly used macroscopic (or mass- in their own community. For both mass-action and agent- action) compartmental models are not appropriate when out- based models, this trust is typically best instilled via a model comes depend on the characteristics of heterogeneous in- calibration process that ensures that the model accurately dividuals. In such cases, network models (Bansal, Gren- tracks past data. For example, Hoertel et al. (2020) perform fell, and Meyers 2007; Liu et al. 2018; Khadilkar, Ganu, a calibration using daily mortality data until 15 April. Sim- and Seetharam 2020) and agent-based models (Grefenstette ilarly, Libin et al. (2020) calibrate their model based on the et al. 2013; Del Valle, Mniszewski, and Hyman 2013; Aleta symptomatic cases reported by the British Health Protection et al. 2020) may be more useful predictors. Network mod- Agency for the 2009 influenza pandemic. Aleta et al. (2020), els encode the relationships between individuals as static instead, only calibrate the weights of intra-layer links by connections in a contact graph along which the disease can means of a rescaling factor, such that the mean number of propagate. Conversely, agent-based simulations, such as the daily effective contacts in that layer matches mean number one introduced in this paper, explicitly track individuals, of daily effective contacts in the corresponding social set- their current disease states, and their interactions with other ting. While not a main focus of our research, we have taken agents over time. Agent-based models allow one to model initial steps to demonstrate that our model can be calibrated as much complexity as desired—even to the level of simu- to track real-world data, as described in Section 3. lating individual people and locations as we do—and thus enable one to model people’s interactions at offices, stores, 3 PandemicSimulator: A COVID-19 schools, etc. Because of their increased detail, they enable one to study the hyper-local interventions that governments Simulator consider when setting policy. For instance, Larremore et al. The functional blocks of PANDEMIC S IMULATOR, shown in (2020) simulate the SARS-CoV-2 dynamics both through a Figure 1, are: fully-mixed mass-action model and an agent-based model • locations, with properties that define how people interact representing the population and contact structure of New within them; York City. PANDEMIC S IMULATOR has the level of details needed to • people, who travel from one location to another according allow us to apply RL to optimize dynamic government in- to individual daily schedules; tervention policies (sometimes referred to as “trigger anal- • an infection model that updates the infection state of each ysis” e.g. Duque et al. 2020). RL has been applied previ- person; ously to several mass-action models (Libin et al. 2020; Song et al. 2020). These models, however, do not take into ac- • an optional testing strategy that imperfectly exposes the count individual behaviors or any complex interaction pat- infection state of the population; terns. The work that is most closely related to our own in- • an optional contact tracing strategy that identifies an in- cludes both the SARS-CoV-2 epidemic simulators from Ho- fected person’s recent contacts; ertel et al. (2020) and Aleta et al. (2020), which model in- dividuals grouped into households who visit and interact in • a government that makes policy decisions. the community. While their approach builds accurate contact The simulator models a day as 24 discrete hours, with networks of real populations, it doesn’t allow us to model each person potentially changing locations each hour. At how the contact network would change as the government the end of a day, each person’s infection state is updated. intervenes. Xiao et al. (2020) construct a detailed, pedes- The government interacts with the environment by declar- trian level simulation that simulates transmission indoors ing regulations, which impose restrictions on the people and and study three types of interventions. Liu (2020) presents locations. If the government activates testing, the simulator a microscopic approach to model epidemics, which can ex- identifies a set of people to be tested and (imperfectly) re- plicitly consider the consequences of individuals’ decisions ports their infection state. If contact tracing is active, each on the spread of the disease. Multi-agent RL is then used to person’s contacts from the previous days are updated. The let individual agents learn to avoid infections. updated perceived infection state and other state variables For any model to be accepted by real-world decision- are returned as an observation to the government. The pro- makers, they must be provided with a reason to trust that cess iterates as long as the infection remains active. The fol- lowing subsections describe the functional blocks of the sim- mal social interactions, households may attend social events ulator in greater detail.2 twice a month, subject to limits on gathering sizes. At the end of each simulated day, the person’s infection Locations state is updated through a stochastic model based on all of that individual’s interactions during the day (see next sec- Each location has a set of attributes that specify when the tion). Unless otherwise prescribed by the government, when location is open, what roles people play there (e.g. worker a person becomes ill they follow their routine. However, or visitor), and the maximum number of people of each even the most basic government interventions require sick role. These attributes can be adjusted by regulations, such people to stay home, and at-risk individuals to avoid large as when the government determines that businesses should gatherings. If a person becomes critically ill, they are admit- operate at half capacity. Non-essential locations can be com- ted to the hospital, assuming it has not reached capacity. pletely closed by the government. The location types used in our experiments are homes, hospitals, schools, grocery SEIR Infection Model stores, retail stores, and hair salons. The simulator provides interfaces to make it easy to add new location types. PANDEMIC S IMULATOR implements a modified SEIR (sus- ceptible, exposed, infected, recovered) infection model, as One of the advantages of an agent-based approach is that shown in Figure 2. See supplemental Appendix A, Table 2 we can more accurately model variations in the way people for specific parameter values and the transition probabilities interact in different types of locations based on their roles. of the SEIR model. Once exposed to the virus, an individ- The base location class supports workers and visitors, and ual’s path through the disease is governed by the transition defines a contact rate, bloc , as a 3-tuple (x, y, z) ∈ [0, 1]3 , probabilities. However, the transition from the susceptible where x is the worker-worker rate, y is the worker-visitor state (S) to the exposed state (E) requires a more detailed rate, and z is the visitor-visitor rate. These rates are used to explanation. sample interactions every hour in each location to compute At the beginning of the simulation, a small, randomly se- disease transmissions. For example, consider a location that lected set of individuals seeds the pandemic in the latent has a contact rate of (0.5, 0.3, 0.4) and 10 workers and 20 non-infectious, exposed state (E). The rest of the population visitors. In expectation, a worker would make contact with 5 starts in S. The exposed individuals soon transition to one co-workers and 6 visitors in the given hour. Similarly, a vis- of the infectious states and start interacting with susceptible itor would be expected to make contact with 3 workers and people. For each susceptible person i, the probability they 8 other visitors. Refer to our supplementary material (Ap- become infected on a given day, PiS→E (day), is calculated pendix A, Table 1) for a listing of the contact rates and other based on their contacts with infectious people that day. parameters for all location types used in our experiments. The base location type can be extended for more complex 23 Y S→E situations. For example, a hospital adds an additional role PiS→E (day) = 1 − Pi (t) (1) (critically sick patients), a capacity representing ICU beds, t=0 and contact rates between workers and patients. S→E where P i (t) is the probability that person i is not in- Population fected at hour t. Whether a susceptible person becomes in- fected in a given hour depends on whom they come in con- A person in the simulator is an automaton that has a state bj and a person-specific behavior routine. These routines create tact with. Let Cij (t) = {p ∼ Nj (t)|p ∈ Njinf (t)} be the set person-to-person interactions throughout the simulated day of infected contacts of person i in location j at hour t where and induce dynamic contact networks. Njinf (t) is the set of infected persons in location j at time Individuals are assigned an age, drawn from the distribu- t, Nj (t) is the set of all persons in j at time t, and bj is a tion of the US age demographics, and are randomly assigned hand-set contact rate for j. To model the variations in how to be either high risk or of normal health. Based on their age, easily individuals spread the disease, each individual k has each person is categorized as either a minor, a working adult an infection spread rate, ak ∼ N bounded (a, σ) sampled from or a retiree. Working adults are assigned to a work location, a bounded Gaussian distribution. Accordingly, and minors to a school, which they attend 8 hours a day, five S→E Y days a week. Adults and retirees are assigned favorite hair Pi (t) = (1 − ak ). (2) salons which they visit once a month, and grocery and re- k∈Cij (t) tail stores which they visit once a week. Each person has a compliance parameter that determines the probability that the person flouts regulations each hour. Testing and Contact Tracing The simulator constructs households from this population PANDEMIC S IMULATOR features a testing procedure to such that 15% house only retirees, and the rest have at least identify positive cases of COVID -19. We do not model con- one working adult and are filled by randomly assigning the comitant illnesses, so every critically sick or dead person remaining children, adults, and retirees. To simulate infor- is assumed to have tested positive. Non-symptomatic and symptomatic individuals—and individuals that previously 2 tested positive—get tested all at different configurable rates. We relegate some implementation details to an appendix at https://arxiv.org/pdf/2010.10560.pdf. Additionally, we model false positive and false negative test Legend A ρA A around the world. These parameters can also be used to cus- P I S Susceptible E Exposed tomize the simulator to match a specific community. A dis- σ(1-τ) 𝛾A PA Pre-Asymtomatic cussion of our calibration process and the values we chose PY Pre-Symtomatic to model COVID -19 are discussed in Appendix A. στ Y ρY Y (1-𝜋)𝛾Y IA Asymptomatic S E P I R IY Symptomatic 𝜋η (1-𝜈)𝛾H Critical CH (Hospitalized) 4 RL for Optimization of Regulations (1-𝜈)𝛾N C N Critical (Not-Hospitalized) An ideal solution to minimize the spread of a new disease CH 𝜈𝜇 D R Recovered like COVID -19 is to eliminate all non-essential interactions φκ D Dead CN and quarantine infected people until the last infected person has recovered. However, the window to execute this policy Figure 2: SEIR model used in PANDEMIC S IMULATOR with minimal economic impact is very small. Once the dis- ease spreads widely this policy becomes impractical and the potential negative impact on the economy becomes enor- results. Refer to the supplementary material (Appendix A, mous. In practice, around the world we have seen a strict Table 1) for a listing of the testing rates used in our experi- lockdown followed by a gradual reopening that attempts to ment. minimize the growth of the infection while allowing partial The government can also implement a contact tracing economic activity. Because COVID -19 is highly contagious, strategy that tracks, over the last N days, the number of has a long incubation period, and large portions of the in- times each pair of individuals interacted. When activated, fected population are asymptomatic, managing the reopen- this procedure allows the government to test or quarantine ing without overwhelming healthcare resources is challeng- all recent 1st -order contacts and their households when an ing. In this section, we tackle this sequential decision mak- individual tests positive for COVID -19. ing problem using reinforcement learning (RL; Sutton and Barto 2018) to optimize the reopening policy. Government Regulations To define an RL problem we need to specify the environ- ment, observations, actions, and rewards. As discussed earlier (see Figure 1), the government an- nounces regulations to try to control the pandemic. The gov- Environment: The agent-based pandemic simulator PAN - ernment can impose the following rules: DEMIC S IMULATOR is the environment.3 • social distancing: a value β ∈ [0, 1] that scales the con- Actions: The government is the learning agent. Its goal is tact rates of each location by (1 − β). 0 corresponds to to maximize its reward over the horizon of the pandemic. Its unrestricted interactions; 1 eliminates all interactions; action set is constrained to a pool of escalating stages, which it can either increase, decrease, or keep the same when it • stay home if sick: a boolean. When set, people who have takes an action. Refer to Appendix A, Table 3 for detailed tested positive are requested to stay at home; descriptions of the stages. • practice good hygiene: a boolean. When set, people are requested to practice better-than-usual hygiene. Observations: At the end of each simulated day, the gov- ernment observes the environment. For the sake of realism, • wear facial coverings: a boolean. When set, people are the infection status of the population is partially observable, instructed to wear facial coverings. accessible only via statistics reflecting aggregate (noisy) test • avoid gatherings: a value that indicates the maximum rec- results and number of hospitalizations.4 ommended size of gatherings. These values can differ for Rewards: We designed our reward function to encourage high risk individuals and those of normal health; the agent to keep the number of persons in critical condition • closed businesses: A list of non-essential business loca- (nc ) below the hospital’s capacity (C max ), while keeping the tion types that are not permitted to open. economy as unrestricted as possible. To this end, we use a reward that is a weighted sum of two objectives: These types of regulations, modeled after government  c n − C max stagep  policies seen throughout the world, are often bundled into progressive stages to make them easier to communicate to r = a max , 0 +b (3) C max maxj stagepj the population. Refer to Appendix A, Tables 1-3 for details on the parameters, their sources and the values set for each where stage ∈ [0, 4] denotes one of the 5 stages with stage4 stage. being the most restrictive. a, b and p are set to −0.4, −0.1 and 1.5, respectively, in our experiments. To discourage fre- Calibration quently changing restrictions, we also use a small shaping PANDEMIC S IMULATOR includes many parameters whose 3 For the purpose of our experiments, we assume no vaccine is values are still poorly known, such as the spread rate of on the horizon and that survival rates remain constant. In practice, COVID -19 in grocery stores and the degree to which face one may want to model the effect of improving survival rates as the masks reduce transmission. We therefore consider these pa- medical community gains experience treating the virus. rameters as free variables that can be used to calibrate the 4 The simulator tracks ground truth data, like the number of simulator to match the historical data that has been observed people in each infection state, for evaluation and reporting. Global Infection Summary Global Testing Summary Critical Summary 1000 critical (C) 1000 critical (C) 30 critical (C) shows the metrics observed by the government through the 25 Max hospital capacity 800 dead (D) infected (I) 800 dead (D) infected (I) 20 lens of testing and hospitalizations. This plot illustrates how persons persons persons 600 none (N) recovered (R) 600 none (N) recovered (R) 15 the government sees information that is both an underesti- 400 400 10 mate of the penetration and delayed in time from the true 200 200 5 state. Finally, Figure 3(c) shows that the number of people 0 0 20 40 60 time (days) 80 100 120 0 0 20 40 60 time (days) 80 100 120 0 0 20 40 60 time (days) 80 100 120 in critical condition goes well above the maximum hospital (a) (b) (c) capacity (denoted with a yellow line) resulting in many peo- ple being more likely to die. The goal of a good reopening Figure 3: A single run of the simulator with no government policy is to keep the red curve below the yellow line, while restrictions, showing (a) the true global infection summary keeping as many businesses open as possible. (b) the perceived infection state, and (c) the number of peo- Figure 4 shows plots of our infection metrics averaged ple in critical condition over time. over 30 randomly seeded runs. Each row in Figures 4(a-o) shows the results of executing a different (constant) regula- tion stage (after a short initial S0 phase), where S4 is the reward (with −0.02 coefficient) proportional to |stage(t − most restrictive and S0 is no restrictions. As expected, Fig- 1) − stage(t)|. This linear mapping of stages into a [0, 1] re- ures 4(p-r) show that the infection peaks, critical cases and ward space is arbitrary; if PANDEMIC S IMULATOR were be- number of deaths are all lower for more restrictive stages. ing used to make real policy decisions, policy makers would One way of explaining the effects of these regulations is use values that represent the real economic costs of the dif- that the government restrictions alter the connectivity of the ferent stages. contact graph. For example, in the experiments above, under Training: We use the discrete-action Soft Actor Critic stage 4 restrictions there are many more connected compo- (SAC; Haarnoja et al. 2018) off-policy RL algorithm to nents in the resulting contact graph than in any of the other optimize a reopening policy, where the actor and critic net- 4 cases. See Appendix A for details of this analysis. works are two-layer deep multi-layer perceptrons with 128 Higher stage restrictions, however, have increased socio- hidden units. One motivation behind using SAC over deep economic costs (Figure 4(s); computed using the second ob- Q-learning approaches such as DQN (Mnih et al. 2015) is jective in Eq. 3). Our RL experiments illustrate how these that we can provide the true infection summary as inputs to competing objectives can be balanced. the critic while letting the actor see only the observed infec- A key benefit of PANDEMIC S IMULATOR’s agent-based tion summary. Training is episodic with each episode lasting approach is that it enables us to evaluate more dynamic poli- 120 simulated days. At the end of each episode, the environ- cies7 than those described above. In the remainder of this ment is reset to an initial state. Refer to Appendix A, Table 1 section we compare a set of hand constructed policies, ex- for learning parameters. amine (approximations) of two real country’s policies, and study the impact of contact tracing. In Appendix A we also 5 Experiments provide an analysis of the model’s sensitivity to its parame- ters. Finally, we demonstrate the application of RL to con- The purpose of PANDEMIC S IMULATOR is to enable a more struct dynamic polices that achieve the goal of avoiding ex- realistic evaluation of potential government policies for pan- ceeding hospital capacity while minimizing economic costs. demic mitigation. In this section, we validate that the simu- As in Figure 4, throughout this section we report our results lation behaves as expected under controlled conditions, il- using plots that are generated by executing 30 simulator runs lustrate some of the many analyses it facilitates, and most with fixed seeds. All our experiments were run on a single importantly, demonstrate that it enables optimization via RL. core, using an Intel i7-7700K CPU @ 4.2GHz with 32GB of Unless otherwise specified, we consider a community size RAM. of 1,000 and a hospital capacity of 10.5 To enable calibra- tion with real data, we limit government actions to five regu- Benchmark Policies lation stages similar to those used by real-world cities6 (see appendix for details), and assume the government does not To serve as benchmarks, we defined three heuristic and two act until at least five people are infected. policies inspired by real governments’ approaches to man- Figure 3 shows plots of a single simulation run with no aging the pandemic. government regulations (Stage 0). Figure 3(a) shows the • S0-4-0: Using this policy, the government switches from number of people in each infection category per day. With- stage 0 to 4 after reaching a threshold of 10 infected per- out government intervention, all individuals get infected, sons. After 30 days, it switches directly back to stage 0; with the infection peaking around the 25th day. Figure 3(b) • S0-4-0-FI: The government starts like S0-4-0, but after 30 5 PANDEMIC S IMULATOR can easily handle larger experiments days it executes a fast, incremental (FI) return to stage 0, at the cost of greater time and computation. Informal experiments with intermediate stages lasting 5 days; showed that results from a population of 1k are generally consistent with results from a larger population when all other settings are 7 In this paper, we use the word “policy” to mean a function the same (or proportional). Refer to Table 7 in the appendix for from state of the world to the regulatory action taken. It represents simulation times for 1k and 10k population environments. both the government’s policy for combating the pandemic (even if 6 Such as at https://tinyurl.com/y3pjthyz heuristic) and the output of an RL optimization. Global Infection Summary Global Testing Summary Critical Summary (30 trials) (30 trials) (30 trials) Infection Peak (normalized) Critical (> max capacity) 0 critical (C) 0 0 40 30 0.50 persons dead (D) infected (I) 1000 1000 S0 persons x days / max capacity none (N) 20 0.45 25 persons / population size recovered (R) Max hospital capacity 0 0 0 0.40 0 50 100 0 50 100 0 50 100 20 (a) (b) (c) 0.35 0.30 15 0 1 0 1 0 1 40 persons 1000 1000 0.25 S1 20 10 0.20 0 0 0 5 0 50 100 0 50 100 0 50 100 0.15 Regulation Stage (d) (e) (f) 0 0 2 0 2 0 2 40 S0 S1 S2 S3 S4 S0 S1 S2 S3 S4 persons 1000 1000 Regulation Stage Regulation Stage S2 20 (p) (q) 0 0 0 Deaths (normalized) Economic Costs 0 50 100 0 50 100 0 50 100 0 (g) (h) (i) 0.035 0 3 0 3 0 3 2 40 0.030 persons / population size persons 1000 1000 S3 20 0.025 4 0 0 0 0.020 0 50 100 0 50 100 0 50 100 6 (j) (k) (l) 0.015 0 4 0 4 0 4 8 40 0.010 persons 1000 1000 S4 20 10 0.005 0 0 0 0 50 100 0 50 100 0 50 100 S0 S1 S2 S3 S4 S0 S1 S2 S3 S4 time (days) time (days) time (days) Regulation Stage Regulation Stage (m) (n) (o) (r) (s) Figure 4: Simulator dynamics at different regulation stages. The plots are generated based on 30 different randomly seeded runs of the simulator. Mean is shown by a solid line and variance either by a shaded region or an error line. In the left set of graphs, the red line at the top indicates what regulation stage is in effect on any given day. • S0-4-0-GI: This policy implements a more gradual incre- different testing rates and contact horizons. We consider mental (GI) return to stage 0, with each intermediate stage daily testing rates of 0.02, 0.3, and 1.0 (where 1.0 repre- lasting 10 days; sents the extreme case of everyone being tested every day) • SWE: This policy represents the one adopted by the and contact tracing histories of 0, 2, 5, or 10 days. For each Swedish government, which recommended, but did not condition, we ran the experiments with the same 30 random require remote work, and was generally unrestrictive.8 Ta- seeds. The full results appear in Appendix A. ble 4 in the appendix shows how we mapped this policy Not surprisingly, contact tracing is most beneficial with into a 2-stage action space. higher testing rates and longer contact histories because more testing finds more infected people and the contact trac- • ITA: this policy represents the one adopted by the Italian ing is able to encourage more of that person’s contacts to government, which was generally much more restrictive.9 stay home. Of course, the best strategy is to test every person Table 5 in the appendix shows our mapping of this policy every day and quarantine anyone who tests positive. Unfor- to a 5-stage action space. tunately, this strategy is impractical except in the most iso- Figure 5 compares the hand constructed policies. From lated communities. Although this aggressive strategy often the point of view of minimizing overall mortality, S0-4-0- stamps out the disease, the false-negative test results some- GI performed best. In particular, slower re-openings ensure times allow the infection to simmer below the surface and longer but smaller peaks. While this approach leads to a sec- spread very slowly through the population. ond wave right after stage 0 is reached, the gradual policy prevents hospital capacity from being exceeded. Optimizing Reopening using RL Figure 5 also contrasts the approximations of the policies A major design goal of PANDEMIC S IMULATOR is to support employed by Sweden and Italy in the early stages of the optimization of re-opening policies using RL. In this section, pandemic (through February 2020). The ITA policy leads we test our hypothesis that a learned policy can outperform to fewer deaths and only a marginally longer duration. How- the benchmark policies. Specifically, RL optimizes a policy ever, this simple comparison does not account for the eco- that (a) is adaptive to the changing infection state, (b) keeps nomic cost of policies, an important factor that is considered the number of critical patients below the hospital threshold, by decision-makers. and (c) minimizes the economic cost. We ran experiments using the 5-stage regulations defined Testing and Contact Tracing in Table 3 (Appendix A); trained the policy by running RL To validate PANDEMIC S IMULATOR’s ability to model test- optimization for roughly 1 million training steps; and eval- ing and contact tracing we compare several strategies with uated the learned policies across 30 randomly seeded ini- tial conditions. Figures 6(a-f) show results comparing our 8 https://tinyurl.com/y57yq2x7; https://tinyurl.com/y34egdeg best heuristic policy (S0-4-0-GI) to the learned policy. The 9 https://tinyurl.com/y3cepy3m learned policy is better across all metrics as shown in Fig- Critical Summary (30 trials) S0-4-0 S0-4-0-FI S0-4-0-GI SWE ITA critical (C) 0 4 0 0 4 321 0 0 4 3 2 1 0 0 1 012 3 4 3 2 dead (D) infected (I) 40 40 40 40 40 30 30 30 30 30 persons 20 20 20 20 20 10 10 10 10 10 0 0 0 0 0 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 time (days) time (days) time (days) time (days) time (days) (a) (b) (c) (d) (e) Infection Peak (normalized) Critical (> max capacity) Deaths (normalized) Pandemic Duration Economic Costs persons x days / max capacity 0.35 0 persons / population size persons / population size 25 0.03 20.0 0.30 20 2 17.5 0.25 15 0.02 4 days 15.0 0.20 10 6 12.5 0.15 0.01 5 8 10.0 0.10 0 -0 -FI -0-G I WE ITA -0 -FI -0-G I SW E ITA -0 -FI -GI SW E ITA -0 -FI -GI SW E ITA -0 -0-F I -0-G I SW E ITA S0-4 S0-4 -0 S S0-4 S0-4 -0 S0-4 S0-4 -0 -0 S0-4 S0-4 -0 -0 S0-4 S0-4 S0-4 S0-4 S0-4 S0-4 S0-4 (f) (g) (h) (i) (j) Figure 5: Simulator dynamics under different hand constructed and reference government policies. Global Infection Summary Critical Summary Stages over Time persons x days / max capacity (30 trials) (30 trials) (shown for 3 trials) Infection Peak (normalized) Critical (> max capacity) Lockdown S0-4-0-GI persons / population size critical (C) 1000 (Stage-4) dead (D) infected (I) 20 0.30 none (N) 10 recovered (R) Open 0.25 0 0 (Stage-0) 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0.20 (a) (b) (c) 5 Lockdown 0.15 1000 Learned (Stage-4) 20 0.10 0 I d 3 7 k I d 3 7 k 0 Open -0-G rne l_w l_w l_10 -0-G earne l_w l_w l_10 0 (Stage-0) S0-4 Lea Eva Eva Eva S0-4 L Eva Eva Eva 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Policies (d) (e) (f) Lockdown (m) (n) 1000 Eval_w3 (Stage-4) Deaths (normalized) Cumulative Reward 20 0.0 persons / population size Open 0 0 (Stage-0) 0.020 2.5 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 (g) (h) (i) 5.0 0.015 Lockdown Eval_10k 10000 (Stage-4) 7.5 200 0.010 10.0 Open 0 0 (Stage-0) 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 I rne d 3 7 k I d 3 7 k -0-G Lea Eva l_w Eva l_w l_10 -0-G Lea rne Eva l_w Eva l_w l_10 time (days) time (days) time (days) S0-4 Eva S0-4 Eva (j) (k) (l) (o) (p) Figure 6: Simulator runs comparing the S0-4-0-GI heuristic policy with a learned policy. The figure also shows results of the learned policy evaluated at different action frequencies and in a larger population environment. ures 6(m-p). Further, we can see how the learned policy re- of infections under control. Whether this behavior is specific acts to the state of the pandemic; Figure 6(f) shows differ- to school reopening is one of many interesting questions that ent traces through the regulation space for 3 of the trials. this type of simulator allows us to investigate. The learned policy briefly oscillates between Stages 3 and 4 around day 40. To minimize such oscillations, we evaluated 6 Conclusion the policy at an action frequency of one action every 3 days (bi-weekly; labeled as Eval w3) and every 7 days (weekly; Epidemiological models aim at providing predictions re- labeled as Eval w7). Figure 6(p) shows that the bi-weekly garding the effects of various possible intervention policies variant performs well, while making changes only once a that are typically manually selected. In this paper, instead, week slightly reduces the reward. To test robustness to scal- we introduce a reinforcement learning methodology for op- ing, we also evaluated the learned policy (with daily actions) timizing adaptive mitigation policies aimed at maximizing in a town with a population of 10,000 (Eval 10k) and found the degree to which the economy can remain open with- that the results transfer well. This success hints at the pos- out overwhelming the local hospital capacity. To this end, sibility of learning policies quickly even when intending to we implement an open-source agent-based simulator, where transfer them to large cities. pandemics can be generated as the result of the contacts and interactions between individual agents in a community. We This section presented results on applying RL to optimize analyze the sensitivity of the simulator to some of its main reopening policies. An interesting next step would be to parameters and illustrate its main features, while also show- study and explain the learned policies as simpler rule based ing that adaptive policies optimized via RL achieve better strategies to make it easier for policy makers to implement. performance when compared to heuristic policies and poli- For example, in Figure 6(l), we see that the RL policy waits cies representative of those used in the real world. at stage 2 before reopening schools to keep the second wave While our work opens up the possibility to use machine learning to explore fine-grained policies in this context, Libin, P.; Moonens, A.; Verstraeten, T.; Perez-Sanjines, F.; PANDEMIC S IMULATOR could be expanded and improved in Hens, N.; Lemey, P.; and Nowé, A. 2020. Deep rein- several directions. One important direction for future work forcement learning for large-scale epidemic control. arXiv is to perform a more complete and detailed calibration of preprint arXiv:2003.13676 . its parameters against real-world data. It would also be use- Liu, C. 2020. A microscopic epidemic model and pandemic ful to implement and analyze additional testing and contact prediction using multi-agent reinforcement learning. arXiv tracing strategies to contain the spread of pandemics. preprint arXiv:2004.12959 . References Liu, Q.-H.; Ajelli, M.; Aleta, A.; Merler, S.; Moreno, Y.; and Vespignani, A. 2018. Measurability of the epidemic repro- Aleta, A.; Martı́n-Corral, D.; y Piontti, A. P.; Ajelli, M.; duction number in data-driven contact networks. Proceed- Litvinova, M.; Chinazzi, M.; Dean, N. E.; Halloran, M. E.; ings of the National Academy of Sciences 115(50): 12680– Longini Jr, I. M.; Merler, S.; et al. 2020. Modelling the im- 12685. ISSN 0027-8424. doi:10.1073/pnas.1811115115. pact of testing, contact tracing and household quarantine on URL https://www.pnas.org/content/115/50/12680. second waves of COVID-19. Nature Human Behaviour 1–8. Bansal, S.; Grenfell, B. T.; and Meyers, L. A. 2007. When Metcalf, C. J. E.; and Lessler, J. 2017. Opportunities and individual behaviour matters: homogeneous and network challenges in modeling emerging infectious diseases. Sci- models in epidemiology. Journal of the Royal Society In- ence . terface 4(16): 879–891. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid- Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. jeland, A. K.; Ostrovski, G.; et al. 2015. Human-level con- arXiv preprint arXiv:1606.01540 . trol through deep reinforcement learning. nature 518(7540): 529–533. Cobey, S. 2020. Modeling infectious disease dynamics. Sci- ence . Rivers, C. M.; and Scarpino, S. V. 2018. Modelling the tra- jectory of disease outbreaks works. Nature . Del Valle, S. Y.; Mniszewski, S. M.; and Hyman, J. M. 2013. Modeling the impact of behavior changes on the spread of Song, S.; Zong, Z.; Li, Y.; Liu, X.; and Yu, Y. 2020. Rein- pandemic influenza. In Modeling the interplay between hu- forced Epidemic Control: Saving Both Lives and Economy. man behavior and the spread of infectious diseases, 59–77. arXiv preprint arXiv:2008.01257 . Springer. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learn- Duque, D.; Morton, D. P.; Singh, B.; Du, Z.; Pasco, R.; ing: An introduction. MIT press. and Meyers, L. A. 2020. COVID-19: How to Relax So- Tolles, J.; and Luong, T. 2020. Modeling Epidemics With cial Distancing If You Must. medRxiv doi:10.1101/2020.04. Compartmental Models. JAMA . 29.20085134. URL https://www.medrxiv.org/content/early/ Xiao, Y.; Yang, M.; Zhu, Z.; Yang, H.; Zhang, L.; and 2020/05/05/2020.04.29.20085134. Ghader, S. 2020. Modeling indoor-level non-pharmaceutical Grefenstette, J. J.; Brown, S. T.; Rosenfeld, R.; DePasse, J.; interventions during the COVID-19 pandemic: a pedestrian Stone, N. T.; Cooley, P. C.; Wheaton, W. D.; Fyshe, A.; Gal- dynamics-based microscopic simulation approach. arXiv loway, D. D.; Sriram, A.; et al. 2013. FRED (A Framework preprint arXiv:2006.10666 . for Reconstructing Epidemic Dynamics): an open-source software system for modeling infectious diseases and con- trol strategies using census-based populations. BMC public health 13(1): 1–14. Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Re- inforcement Learning with a Stochastic Actor. In Interna- tional Conference on Machine Learning, 1861–1870. Hoertel, N.; Blachier, M.; Blanco, C.; Olfson, M.; Massetti, M.; Sánchez Rico, M.; Limosin, F.; and Leleu, H. 2020. A stochastic agent-based model of the SARS-CoV-2 epidemic in France. Nature Medicine . Khadilkar, H.; Ganu, T.; and Seetharam, D. P. 2020. Op- timising Lockdown Policies for Epidemic Control using Reinforcement Learning. Transactions of Indian National Academy of Engineering . Larremore, D. B.; Wilder, B.; Lester, E.; Shehata, S.; Burke, J. M.; Hay, J. A.; Tambe, M.; Mina, M. J.; and Parker, R. 2020. Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance. MedRxiv .