Reinforcement Learning for Optimization of COVID-19 Mitigation Policies
     Varun Kompella* 1 , Roberto Capobianco* 1, 2 , Stacy Jong3 , Jonathan Browne3 , Spencer Fox3 ,
                          Lauren Meyers3 , Peter Wurman1 , Peter Stone1, 3
                                                              1
                                                                Sony AI
                                                       2
                                                      Sapienza University of Rome
                                                  3
                                                    The University of Texas at Austin
                          *
                            Joint First Authors, varun.kompella@sony.com, roberto.capobianco@sony.com

                            Abstract                                    For such learned policies to be relevant, they must be
  The year 2020 has seen the COVID -19 virus lead to one of          trained within an epidemiological model that accurately sim-
  the worst global pandemics in history. As a result, govern-        ulates the spread of the pandemic, as well as the effects of
  ments around the world are faced with the challenge of pro-        government measures. To the best of our knowledge, none of
  tecting public health, while keeping the economy running to        the existing epidemiological simulations have the resolution
  the greatest extent possible. Epidemiological models provide       to allow reinforcement learning to explore the regulations
  insight into the spread of these types of diseases and predict     that governments are currently struggling with.
  the effects of possible intervention policies. However, to date,      Motivated by this, our main contributions are:
  the even the most data-driven intervention policies rely on
  heuristics. In this paper, we study how reinforcement learn-       1. The introduction of PANDEMIC S IMULATOR, a novel
  ing (RL) can be used to optimize mitigation policies that min-        open-source1 agent-based simulator that models the in-
  imize the economic impact without overwhelming the hospi-
                                                                        teractions between individuals at specific locations within
  tal capacity. Our main contributions are (1) a novel agent-
  based pandemic simulator which, unlike traditional models, is         a community. Developed in collaboration between AI
  able to model fine-grained interactions among people at spe-          researchers and epidemiologists (the co-authors of this
  cific locations in a community; and (2) an RL-based method-           paper), PANDEMIC S IMULATOR models realistic effects
  ology for optimizing fine-grained mitigation policies within          such as testing with false positive/negative rates, imper-
  this simulator. Our results validate both the overall simulator       fect public adherence to social distancing measures, con-
  behavior and the learned policies under realistic conditions.         tact tracing, and variable spread rates among infected
                                                                        individuals. Crucially, PANDEMIC S IMULATOR models
                     1     Introduction                                 community interactions at a level of detail that allows the
Motivated by the devastating COVID -19 pandemic, much                   spread of the disease to be an emergent property of peo-
of the scientific community, across numerous disciplines,               ple’s behaviors and the government’s policies. An in-
is currently focused on developing safe, quick, and effec-              terface with OpenAI Gym (Brockman et al. 2016) is pro-
tive methods to prevent the spread of biological viruses, or            vided to enable support for standard RL libraries;
to otherwise mitigate the harm they cause. These methods             2. A demonstration that a reinforcement learning algorithm
include vaccines, treatments, public policy measures, eco-              can indeed identify a policy that outperforms a range of
nomic stimuli, and hygiene education campaigns. Govern-                 reasonable baselines within this simulator;
ments around the world are now faced with high-stakes deci-
sions regarding which measures to enact at which times, of-          3. An analysis of the resulting learned policy, which may
ten involving trade-offs between public health and economic             provide insights regarding the relative efficacy of past and
resiliency. When making these decisions, governments often              potential future COVID -19 mitigation policies.
rely on epidemiological models that predict and project the
course of the pandemic.                                              While the resulting policies have not been implemented in
   The premise of this paper is that the challenge of miti-          any real-world communities, this paper establishes the po-
gating the spread of a pandemic while maximizing personal            tential power of RL in an agent-based simulator, and may
freedom and economic activity is fundamentally a sequen-             serve as an important first step towards real-world adoption.
tial decision-making problem: the measures enacted on one               The remainder of the paper is organized as follows. We
day affect the challenges to be addressed on future days. As         first discuss related work and then introduce our simula-
such, modern reinforcement learning (RL) algorithms are              tor in Section 3. Section 4 presents our reinforcement learn-
well-suited to optimize government responses to pandemics.           ing setup, while results are reported in Section 5. Finally,
                                                                     Section 6 reports some conclusions and directions for future
AAAI Fall 2020 Symposium on AI for Social Good.
Copyright © 2020, for this paper by its authors. Use permitted un-
                                                                     work.
der Creative Commons License Attribution 4.0 International (CC
                                                                         1
BY 4.0).                                                                     https://github.com/SonyAI/PandemicSimulator
                                                                                                                Covid19 Simulator
                    2   Related Work
Epidemiological models differ based on the level of granu-                               Covid Regulation
                                                                                                               Update Sim Characteristics
                                                                                                            (Location Rules, People Behaviors, Stage...)
larity in which they track individuals and their disease states.
“Compartmental” models group individuals of similar dis-                Government                            Covid Testing              Step 24 times (full day)

ease states together, assume all individuals within a spe-                                 Observation
                                                                                                            Infection Model
cific compartment to be homogeneous, and only track the                                                     Contact Tracing
                                                                                                                                              Persons
flow of individuals between compartments (Tolles and Lu-
ong 2020). While relatively simplistic, these models have
been used for decades and continue to be useful for both                     Figure 1: Block diagram of the simulator.
retrospective studies and forecasts as were seen during the
emergence of recent diseases (Rivers and Scarpino 2018;
Metcalf and Lessler 2017; Cobey 2020).                             it accurately models the population and spread dynamics
   However, the commonly used macroscopic (or mass-                in their own community. For both mass-action and agent-
action) compartmental models are not appropriate when out-         based models, this trust is typically best instilled via a model
comes depend on the characteristics of heterogeneous in-           calibration process that ensures that the model accurately
dividuals. In such cases, network models (Bansal, Gren-            tracks past data. For example, Hoertel et al. (2020) perform
fell, and Meyers 2007; Liu et al. 2018; Khadilkar, Ganu,           a calibration using daily mortality data until 15 April. Sim-
and Seetharam 2020) and agent-based models (Grefenstette           ilarly, Libin et al. (2020) calibrate their model based on the
et al. 2013; Del Valle, Mniszewski, and Hyman 2013; Aleta          symptomatic cases reported by the British Health Protection
et al. 2020) may be more useful predictors. Network mod-           Agency for the 2009 influenza pandemic. Aleta et al. (2020),
els encode the relationships between individuals as static         instead, only calibrate the weights of intra-layer links by
connections in a contact graph along which the disease can         means of a rescaling factor, such that the mean number of
propagate. Conversely, agent-based simulations, such as the        daily effective contacts in that layer matches mean number
one introduced in this paper, explicitly track individuals,        of daily effective contacts in the corresponding social set-
their current disease states, and their interactions with other    ting. While not a main focus of our research, we have taken
agents over time. Agent-based models allow one to model            initial steps to demonstrate that our model can be calibrated
as much complexity as desired—even to the level of simu-           to track real-world data, as described in Section 3.
lating individual people and locations as we do—and thus
enable one to model people’s interactions at offices, stores,            3   PandemicSimulator: A COVID-19
schools, etc. Because of their increased detail, they enable
one to study the hyper-local interventions that governments                            Simulator
consider when setting policy. For instance, Larremore et al.       The functional blocks of PANDEMIC S IMULATOR, shown in
(2020) simulate the SARS-CoV-2 dynamics both through a             Figure 1, are:
fully-mixed mass-action model and an agent-based model             • locations, with properties that define how people interact
representing the population and contact structure of New             within them;
York City.
   PANDEMIC S IMULATOR has the level of details needed to          • people, who travel from one location to another according
allow us to apply RL to optimize dynamic government in-              to individual daily schedules;
tervention policies (sometimes referred to as “trigger anal-       • an infection model that updates the infection state of each
ysis” e.g. Duque et al. 2020). RL has been applied previ-            person;
ously to several mass-action models (Libin et al. 2020; Song
et al. 2020). These models, however, do not take into ac-          • an optional testing strategy that imperfectly exposes the
count individual behaviors or any complex interaction pat-           infection state of the population;
terns. The work that is most closely related to our own in-        • an optional contact tracing strategy that identifies an in-
cludes both the SARS-CoV-2 epidemic simulators from Ho-              fected person’s recent contacts;
ertel et al. (2020) and Aleta et al. (2020), which model in-
dividuals grouped into households who visit and interact in        • a government that makes policy decisions.
the community. While their approach builds accurate contact           The simulator models a day as 24 discrete hours, with
networks of real populations, it doesn’t allow us to model         each person potentially changing locations each hour. At
how the contact network would change as the government             the end of a day, each person’s infection state is updated.
intervenes. Xiao et al. (2020) construct a detailed, pedes-        The government interacts with the environment by declar-
trian level simulation that simulates transmission indoors         ing regulations, which impose restrictions on the people and
and study three types of interventions. Liu (2020) presents        locations. If the government activates testing, the simulator
a microscopic approach to model epidemics, which can ex-           identifies a set of people to be tested and (imperfectly) re-
plicitly consider the consequences of individuals’ decisions       ports their infection state. If contact tracing is active, each
on the spread of the disease. Multi-agent RL is then used to       person’s contacts from the previous days are updated. The
let individual agents learn to avoid infections.                   updated perceived infection state and other state variables
   For any model to be accepted by real-world decision-            are returned as an observation to the government. The pro-
makers, they must be provided with a reason to trust that          cess iterates as long as the infection remains active. The fol-
lowing subsections describe the functional blocks of the sim-       mal social interactions, households may attend social events
ulator in greater detail.2                                          twice a month, subject to limits on gathering sizes.
                                                                       At the end of each simulated day, the person’s infection
Locations                                                           state is updated through a stochastic model based on all of
                                                                    that individual’s interactions during the day (see next sec-
Each location has a set of attributes that specify when the         tion). Unless otherwise prescribed by the government, when
location is open, what roles people play there (e.g. worker         a person becomes ill they follow their routine. However,
or visitor), and the maximum number of people of each               even the most basic government interventions require sick
role. These attributes can be adjusted by regulations, such         people to stay home, and at-risk individuals to avoid large
as when the government determines that businesses should            gatherings. If a person becomes critically ill, they are admit-
operate at half capacity. Non-essential locations can be com-       ted to the hospital, assuming it has not reached capacity.
pletely closed by the government. The location types used
in our experiments are homes, hospitals, schools, grocery           SEIR Infection Model
stores, retail stores, and hair salons. The simulator provides
interfaces to make it easy to add new location types.               PANDEMIC S IMULATOR implements a modified SEIR (sus-
                                                                    ceptible, exposed, infected, recovered) infection model, as
   One of the advantages of an agent-based approach is that
                                                                    shown in Figure 2. See supplemental Appendix A, Table 2
we can more accurately model variations in the way people
                                                                    for specific parameter values and the transition probabilities
interact in different types of locations based on their roles.
                                                                    of the SEIR model. Once exposed to the virus, an individ-
The base location class supports workers and visitors, and
                                                                    ual’s path through the disease is governed by the transition
defines a contact rate, bloc , as a 3-tuple (x, y, z) ∈ [0, 1]3 ,
                                                                    probabilities. However, the transition from the susceptible
where x is the worker-worker rate, y is the worker-visitor
                                                                    state (S) to the exposed state (E) requires a more detailed
rate, and z is the visitor-visitor rate. These rates are used to
                                                                    explanation.
sample interactions every hour in each location to compute
                                                                       At the beginning of the simulation, a small, randomly se-
disease transmissions. For example, consider a location that
                                                                    lected set of individuals seeds the pandemic in the latent
has a contact rate of (0.5, 0.3, 0.4) and 10 workers and 20
                                                                    non-infectious, exposed state (E). The rest of the population
visitors. In expectation, a worker would make contact with 5
                                                                    starts in S. The exposed individuals soon transition to one
co-workers and 6 visitors in the given hour. Similarly, a vis-
                                                                    of the infectious states and start interacting with susceptible
itor would be expected to make contact with 3 workers and
                                                                    people. For each susceptible person i, the probability they
8 other visitors. Refer to our supplementary material (Ap-
                                                                    become infected on a given day, PiS→E (day), is calculated
pendix A, Table 1) for a listing of the contact rates and other
                                                                    based on their contacts with infectious people that day.
parameters for all location types used in our experiments.
   The base location type can be extended for more complex                                             23
                                                                                                       Y      S→E
situations. For example, a hospital adds an additional role                      PiS→E (day) = 1 −           Pi     (t)        (1)
(critically sick patients), a capacity representing ICU beds,                                          t=0
and contact rates between workers and patients.                               S→E
                                                                     where P i     (t) is the probability that person i is not in-
Population                                                          fected at hour t. Whether a susceptible person becomes in-
                                                                    fected in a given hour depends on whom they come in con-
A person in the simulator is an automaton that has a state                                      bj
and a person-specific behavior routine. These routines create       tact with. Let Cij (t) = {p ∼ Nj (t)|p ∈ Njinf (t)} be the set
person-to-person interactions throughout the simulated day          of infected contacts of person i in location j at hour t where
and induce dynamic contact networks.                                Njinf (t) is the set of infected persons in location j at time
   Individuals are assigned an age, drawn from the distribu-        t, Nj (t) is the set of all persons in j at time t, and bj is a
tion of the US age demographics, and are randomly assigned          hand-set contact rate for j. To model the variations in how
to be either high risk or of normal health. Based on their age,     easily individuals spread the disease, each individual k has
each person is categorized as either a minor, a working adult       an infection spread rate, ak ∼ N bounded (a, σ) sampled from
or a retiree. Working adults are assigned to a work location,       a bounded Gaussian distribution. Accordingly,
and minors to a school, which they attend 8 hours a day, five                           S→E          Y
days a week. Adults and retirees are assigned favorite hair                           Pi     (t) =        (1 − ak ).            (2)
salons which they visit once a month, and grocery and re-                                        k∈Cij (t)
tail stores which they visit once a week. Each person has
a compliance parameter that determines the probability that
the person flouts regulations each hour.                            Testing and Contact Tracing
   The simulator constructs households from this population         PANDEMIC S IMULATOR features a testing procedure to
such that 15% house only retirees, and the rest have at least       identify positive cases of COVID -19. We do not model con-
one working adult and are filled by randomly assigning the          comitant illnesses, so every critically sick or dead person
remaining children, adults, and retirees. To simulate infor-        is assumed to have tested positive. Non-symptomatic and
                                                                    symptomatic individuals—and individuals that previously
    2                                                               tested positive—get tested all at different configurable rates.
      We relegate some implementation details to an appendix at
https://arxiv.org/pdf/2010.10560.pdf.                               Additionally, we model false positive and false negative test
                                                                      Legend
                            A   ρA            A                                                     around the world. These parameters can also be used to cus-
                            P             I                            S       Susceptible

                                                                       E       Exposed
                                                                                                    tomize the simulator to match a specific community. A dis-
              σ(1-τ)                                         𝛾A        PA      Pre-Asymtomatic      cussion of our calibration process and the values we chose
                                                                       PY      Pre-Symtomatic       to model COVID -19 are discussed in Appendix A.
                       στ   Y   ρY            Y (1-𝜋)𝛾Y                IA      Asymptomatic
     S        E             P             I                       R
                                                                       IY      Symptomatic

                                     𝜋η           (1-𝜈)𝛾H                      Critical
                                                                       CH (Hospitalized)
                                                                                                            4   RL for Optimization of Regulations
                                                            (1-𝜈)𝛾N
                                                                       C   N   Critical
                                                                               (Not-Hospitalized)
                                                                                                    An ideal solution to minimize the spread of a new disease
                                      CH               𝜈𝜇
                                                                  D    R       Recovered            like COVID -19 is to eliminate all non-essential interactions
                                                       φκ              D       Dead
                                      CN                                                            and quarantine infected people until the last infected person
                                                                                                    has recovered. However, the window to execute this policy
  Figure 2: SEIR model used in PANDEMIC S IMULATOR                                                  with minimal economic impact is very small. Once the dis-
                                                                                                    ease spreads widely this policy becomes impractical and the
                                                                                                    potential negative impact on the economy becomes enor-
results. Refer to the supplementary material (Appendix A,                                           mous. In practice, around the world we have seen a strict
Table 1) for a listing of the testing rates used in our experi-                                     lockdown followed by a gradual reopening that attempts to
ment.                                                                                               minimize the growth of the infection while allowing partial
   The government can also implement a contact tracing                                              economic activity. Because COVID -19 is highly contagious,
strategy that tracks, over the last N days, the number of                                           has a long incubation period, and large portions of the in-
times each pair of individuals interacted. When activated,                                          fected population are asymptomatic, managing the reopen-
this procedure allows the government to test or quarantine                                          ing without overwhelming healthcare resources is challeng-
all recent 1st -order contacts and their households when an                                         ing. In this section, we tackle this sequential decision mak-
individual tests positive for COVID -19.                                                            ing problem using reinforcement learning (RL; Sutton and
                                                                                                    Barto 2018) to optimize the reopening policy.
Government Regulations                                                                                 To define an RL problem we need to specify the environ-
                                                                                                    ment, observations, actions, and rewards.
As discussed earlier (see Figure 1), the government an-
nounces regulations to try to control the pandemic. The gov-                                        Environment: The agent-based pandemic simulator PAN -
ernment can impose the following rules:                                                             DEMIC S IMULATOR is the environment.3

• social distancing: a value β ∈ [0, 1] that scales the con-                                        Actions: The government is the learning agent. Its goal is
  tact rates of each location by (1 − β). 0 corresponds to                                          to maximize its reward over the horizon of the pandemic. Its
  unrestricted interactions; 1 eliminates all interactions;                                         action set is constrained to a pool of escalating stages, which
                                                                                                    it can either increase, decrease, or keep the same when it
• stay home if sick: a boolean. When set, people who have
                                                                                                    takes an action. Refer to Appendix A, Table 3 for detailed
  tested positive are requested to stay at home;
                                                                                                    descriptions of the stages.
• practice good hygiene: a boolean. When set, people are
  requested to practice better-than-usual hygiene.                                                  Observations: At the end of each simulated day, the gov-
                                                                                                    ernment observes the environment. For the sake of realism,
• wear facial coverings: a boolean. When set, people are                                            the infection status of the population is partially observable,
  instructed to wear facial coverings.                                                              accessible only via statistics reflecting aggregate (noisy) test
• avoid gatherings: a value that indicates the maximum rec-                                         results and number of hospitalizations.4
  ommended size of gatherings. These values can differ for                                          Rewards: We designed our reward function to encourage
  high risk individuals and those of normal health;                                                 the agent to keep the number of persons in critical condition
• closed businesses: A list of non-essential business loca-                                         (nc ) below the hospital’s capacity (C max ), while keeping the
  tion types that are not permitted to open.                                                        economy as unrestricted as possible. To this end, we use a
                                                                                                    reward that is a weighted sum of two objectives:
   These types of regulations, modeled after government                                                                 c
                                                                                                                         n − C max                 stagep
                                                                                                                                        
policies seen throughout the world, are often bundled into
progressive stages to make them easier to communicate to                                                   r = a max                , 0   +b                    (3)
                                                                                                                            C max               maxj stagepj
the population. Refer to Appendix A, Tables 1-3 for details
on the parameters, their sources and the values set for each                                        where stage ∈ [0, 4] denotes one of the 5 stages with stage4
stage.                                                                                              being the most restrictive. a, b and p are set to −0.4, −0.1
                                                                                                    and 1.5, respectively, in our experiments. To discourage fre-
Calibration                                                                                         quently changing restrictions, we also use a small shaping
PANDEMIC S IMULATOR includes many parameters whose                                                      3
                                                                                                          For the purpose of our experiments, we assume no vaccine is
values are still poorly known, such as the spread rate of                                           on the horizon and that survival rates remain constant. In practice,
COVID -19 in grocery stores and the degree to which face                                            one may want to model the effect of improving survival rates as the
masks reduce transmission. We therefore consider these pa-                                          medical community gains experience treating the virus.
rameters as free variables that can be used to calibrate the                                            4
                                                                                                          The simulator tracks ground truth data, like the number of
simulator to match the historical data that has been observed                                       people in each infection state, for evaluation and reporting.
                 Global Infection Summary                              Global Testing Summary                                  Critical Summary
          1000

                                 critical (C)
                                                                1000

                                                                            critical (C)
                                                                                                                      30
                                                                                                                                    critical (C)                 shows the metrics observed by the government through the
                                                                                                                      25
                                                                                                                                    Max hospital capacity
          800
                                 dead (D)
                                 infected (I)
                                                                800
                                                                            dead (D)
                                                                            infected (I)                              20
                                                                                                                                                                 lens of testing and hospitalizations. This plot illustrates how
persons


                                                      persons


                                                                                                            persons
          600
                                 none (N)
                                 recovered (R)
                                                                600
                                                                            none (N)
                                                                            recovered (R)
                                                                                                                      15
                                                                                                                                                                 the government sees information that is both an underesti-
          400                                                   400
                                                                                                                      10                                         mate of the penetration and delayed in time from the true
          200                                                   200
                                                                                                                      5                                          state. Finally, Figure 3(c) shows that the number of people
            0
                 0   20   40   60
                          time (days)
                                     80   100   120
                                                                  0
                                                                       0   20   40   60
                                                                                time (days)
                                                                                           80   100   120
                                                                                                                      0
                                                                                                                           0   20    40   60
                                                                                                                                     time (days)
                                                                                                                                                80   100   120   in critical condition goes well above the maximum hospital
                               (a)                                                   (b)                                                  (c)                    capacity (denoted with a yellow line) resulting in many peo-
                                                                                                                                                                 ple being more likely to die. The goal of a good reopening
  Figure 3: A single run of the simulator with no government                                                                                                     policy is to keep the red curve below the yellow line, while
  restrictions, showing (a) the true global infection summary                                                                                                    keeping as many businesses open as possible.
  (b) the perceived infection state, and (c) the number of peo-                                                                                                     Figure 4 shows plots of our infection metrics averaged
  ple in critical condition over time.                                                                                                                           over 30 randomly seeded runs. Each row in Figures 4(a-o)
                                                                                                                                                                 shows the results of executing a different (constant) regula-
                                                                                                                                                                 tion stage (after a short initial S0 phase), where S4 is the
  reward (with −0.02 coefficient) proportional to |stage(t −                                                                                                     most restrictive and S0 is no restrictions. As expected, Fig-
  1) − stage(t)|. This linear mapping of stages into a [0, 1] re-                                                                                                ures 4(p-r) show that the infection peaks, critical cases and
  ward space is arbitrary; if PANDEMIC S IMULATOR were be-                                                                                                       number of deaths are all lower for more restrictive stages.
  ing used to make real policy decisions, policy makers would                                                                                                    One way of explaining the effects of these regulations is
  use values that represent the real economic costs of the dif-                                                                                                  that the government restrictions alter the connectivity of the
  ferent stages.                                                                                                                                                 contact graph. For example, in the experiments above, under
  Training: We use the discrete-action Soft Actor Critic                                                                                                         stage 4 restrictions there are many more connected compo-
  (SAC; Haarnoja et al. 2018) off-policy RL algorithm to                                                                                                         nents in the resulting contact graph than in any of the other
  optimize a reopening policy, where the actor and critic net-                                                                                                   4 cases. See Appendix A for details of this analysis.
  works are two-layer deep multi-layer perceptrons with 128                                                                                                         Higher stage restrictions, however, have increased socio-
  hidden units. One motivation behind using SAC over deep                                                                                                        economic costs (Figure 4(s); computed using the second ob-
  Q-learning approaches such as DQN (Mnih et al. 2015) is                                                                                                        jective in Eq. 3). Our RL experiments illustrate how these
  that we can provide the true infection summary as inputs to                                                                                                    competing objectives can be balanced.
  the critic while letting the actor see only the observed infec-                                                                                                   A key benefit of PANDEMIC S IMULATOR’s agent-based
  tion summary. Training is episodic with each episode lasting                                                                                                   approach is that it enables us to evaluate more dynamic poli-
  120 simulated days. At the end of each episode, the environ-                                                                                                   cies7 than those described above. In the remainder of this
  ment is reset to an initial state. Refer to Appendix A, Table 1                                                                                                section we compare a set of hand constructed policies, ex-
  for learning parameters.                                                                                                                                       amine (approximations) of two real country’s policies, and
                                                                                                                                                                 study the impact of contact tracing. In Appendix A we also
                                                      5                Experiments                                                                               provide an analysis of the model’s sensitivity to its parame-
                                                                                                                                                                 ters. Finally, we demonstrate the application of RL to con-
  The purpose of PANDEMIC S IMULATOR is to enable a more                                                                                                         struct dynamic polices that achieve the goal of avoiding ex-
  realistic evaluation of potential government policies for pan-                                                                                                 ceeding hospital capacity while minimizing economic costs.
  demic mitigation. In this section, we validate that the simu-                                                                                                  As in Figure 4, throughout this section we report our results
  lation behaves as expected under controlled conditions, il-                                                                                                    using plots that are generated by executing 30 simulator runs
  lustrate some of the many analyses it facilitates, and most                                                                                                    with fixed seeds. All our experiments were run on a single
  importantly, demonstrate that it enables optimization via RL.                                                                                                  core, using an Intel i7-7700K CPU @ 4.2GHz with 32GB of
     Unless otherwise specified, we consider a community size                                                                                                    RAM.
  of 1,000 and a hospital capacity of 10.5 To enable calibra-
  tion with real data, we limit government actions to five regu-                                                                                                 Benchmark Policies
  lation stages similar to those used by real-world cities6 (see
  appendix for details), and assume the government does not                                                                                                      To serve as benchmarks, we defined three heuristic and two
  act until at least five people are infected.                                                                                                                   policies inspired by real governments’ approaches to man-
     Figure 3 shows plots of a single simulation run with no                                                                                                     aging the pandemic.
  government regulations (Stage 0). Figure 3(a) shows the                                                                                                        • S0-4-0: Using this policy, the government switches from
  number of people in each infection category per day. With-                                                                                                       stage 0 to 4 after reaching a threshold of 10 infected per-
  out government intervention, all individuals get infected,                                                                                                       sons. After 30 days, it switches directly back to stage 0;
  with the infection peaking around the 25th day. Figure 3(b)
                                                                                                                                                                 • S0-4-0-FI: The government starts like S0-4-0, but after 30
                 5
         PANDEMIC S IMULATOR can easily handle larger experiments                                                                                                  days it executes a fast, incremental (FI) return to stage 0,
  at the cost of greater time and computation. Informal experiments                                                                                                with intermediate stages lasting 5 days;
  showed that results from a population of 1k are generally consistent
  with results from a larger population when all other settings are                                                                                                  7
                                                                                                                                                                       In this paper, we use the word “policy” to mean a function
  the same (or proportional). Refer to Table 7 in the appendix for                                                                                               from state of the world to the regulatory action taken. It represents
  simulation times for 1k and 10k population environments.                                                                                                       both the government’s policy for combating the pandemic (even if
       6
         Such as at https://tinyurl.com/y3pjthyz                                                                                                                 heuristic) and the output of an RL optimization.
                                Global Infection Summary                               Global Testing Summary                   Critical Summary
                                         (30 trials)                                           (30 trials)                          (30 trials)                                                                        Infection Peak (normalized)                                           Critical (> max capacity)
                                          0         critical (C)                                   0                                      0             40                                                                                                                              30
                                                                                                                                                                                                                0.50


                                                                                                                                                             persons
                                                    dead (D)
                                                    infected (I)            1000                                 1000


                           S0


                                                                                                                                                                                                                                                        persons x days / max capacity
                                                    none (N)                                                                                            20                                                      0.45                                                                    25


                                                                                                                                                                                    persons / population size
                                                    recovered (R)
                                                    Max hospital capacity
                                                                            0                                    0                                      0                                                       0.40
                                0        50                 100                    0          50           100          0            50           100                                                                                                                                   20
                                              (a)                                               (b)                                   (c)                                                                       0.35

                                                                                                                                                                                                                0.30                                                                    15
                                    0           1                                      0               1                    0                 1         40


                                                                                                                                                             persons
                                                                            1000                                 1000                                                                                           0.25
                           S1


                                                                                                                                                        20                                                                                                                              10
                                                                                                                                                                                                                0.20
                                                                            0                                    0                                      0                                                                                                                               5
                                0        50                 100                    0          50           100          0            50           100                                                           0.15
        Regulation Stage


                                              (d)                                               (e)                                   (f)
                                                                                                                                                                                                                                                                                        0
                                    0           2                                      0               2                    0                 2         40                                                              S0      S1    S2    S3     S4                                        S0    S1    S2    S3    S4


                                                                                                                                                             persons
                                                                            1000                                 1000                                                                                                          Regulation Stage                                                   Regulation Stage
                           S2


                                                                                                                                                        20                                                                           (p)                                                                (q)
                                                                            0                                    0                                      0                                                                    Deaths (normalized)                                                  Economic Costs
                                0        50                 100                    0          50           100          0            50           100                                                                                                                                   0
                                              (g)                                               (h)                                    (i)                                                             0.035

                                    0           3                                      0               3                    0                 3                                                                                                                                          2
                                                                                                                                                        40                                             0.030


                                                                                                                                                                       persons / population size
                                                                                                                                                             persons
                                                                            1000                                 1000
                           S3


                                                                                                                                                        20                                             0.025                                                                             4

                                                                            0                                    0                                      0                                              0.020
                                0        50                 100                    0          50           100          0            50           100                                                                                                                                    6
                                              (j)                                               (k)                                    (l)                                                             0.015
                                    0           4                                      0               4                    0                 4                                                                                                                                          8
                                                                                                                                                        40
                                                                                                                                                                                                       0.010


                                                                                                                                                             persons
                                                                            1000                                 1000
                           S4


                                                                                                                                                        20                                                                                                                              10
                                                                                                                                                                                                       0.005
                                                                            0                                    0                                      0
                                0         50                100                    0           50          100          0            50           100                                                                   S0      S1    S2    S3     S4                                        S0    S1    S2    S3    S4
                                        time (days)                                          time (days)                           time (days)                                                                                 Regulation Stage                                                   Regulation Stage
                                           (m)                                                  (n)                                   (o)                                                                                            (r)                                                                (s)


Figure 4: Simulator dynamics at different regulation stages. The plots are generated based on 30 different randomly seeded runs
of the simulator. Mean is shown by a solid line and variance either by a shaded region or an error line. In the left set of graphs,
the red line at the top indicates what regulation stage is in effect on any given day.


• S0-4-0-GI: This policy implements a more gradual incre-                                                                                         different testing rates and contact horizons. We consider
  mental (GI) return to stage 0, with each intermediate stage                                                                                     daily testing rates of 0.02, 0.3, and 1.0 (where 1.0 repre-
  lasting 10 days;                                                                                                                                sents the extreme case of everyone being tested every day)
• SWE: This policy represents the one adopted by the                                                                                              and contact tracing histories of 0, 2, 5, or 10 days. For each
  Swedish government, which recommended, but did not                                                                                              condition, we ran the experiments with the same 30 random
  require remote work, and was generally unrestrictive.8 Ta-                                                                                      seeds. The full results appear in Appendix A.
  ble 4 in the appendix shows how we mapped this policy                                                                                              Not surprisingly, contact tracing is most beneficial with
  into a 2-stage action space.                                                                                                                    higher testing rates and longer contact histories because
                                                                                                                                                  more testing finds more infected people and the contact trac-
• ITA: this policy represents the one adopted by the Italian                                                                                      ing is able to encourage more of that person’s contacts to
  government, which was generally much more restrictive.9                                                                                         stay home. Of course, the best strategy is to test every person
  Table 5 in the appendix shows our mapping of this policy                                                                                        every day and quarantine anyone who tests positive. Unfor-
  to a 5-stage action space.                                                                                                                      tunately, this strategy is impractical except in the most iso-
   Figure 5 compares the hand constructed policies. From                                                                                          lated communities. Although this aggressive strategy often
the point of view of minimizing overall mortality, S0-4-0-                                                                                        stamps out the disease, the false-negative test results some-
GI performed best. In particular, slower re-openings ensure                                                                                       times allow the infection to simmer below the surface and
longer but smaller peaks. While this approach leads to a sec-                                                                                     spread very slowly through the population.
ond wave right after stage 0 is reached, the gradual policy
prevents hospital capacity from being exceeded.                                                                                                   Optimizing Reopening using RL
   Figure 5 also contrasts the approximations of the policies                                                                                     A major design goal of PANDEMIC S IMULATOR is to support
employed by Sweden and Italy in the early stages of the                                                                                           optimization of re-opening policies using RL. In this section,
pandemic (through February 2020). The ITA policy leads                                                                                            we test our hypothesis that a learned policy can outperform
to fewer deaths and only a marginally longer duration. How-                                                                                       the benchmark policies. Specifically, RL optimizes a policy
ever, this simple comparison does not account for the eco-                                                                                        that (a) is adaptive to the changing infection state, (b) keeps
nomic cost of policies, an important factor that is considered                                                                                    the number of critical patients below the hospital threshold,
by decision-makers.                                                                                                                               and (c) minimizes the economic cost.
                                                                                                                                                     We ran experiments using the 5-stage regulations defined
Testing and Contact Tracing                                                                                                                       in Table 3 (Appendix A); trained the policy by running RL
To validate PANDEMIC S IMULATOR’s ability to model test-                                                                                          optimization for roughly 1 million training steps; and eval-
ing and contact tracing we compare several strategies with                                                                                        uated the learned policies across 30 randomly seeded ini-
                                                                                                                                                  tial conditions. Figures 6(a-f) show results comparing our
   8
       https://tinyurl.com/y57yq2x7; https://tinyurl.com/y34egdeg                                                                                 best heuristic policy (S0-4-0-GI) to the learned policy. The
   9
       https://tinyurl.com/y3cepy3m                                                                                                               learned policy is better across all metrics as shown in Fig-
                                                                                                                                                                                                          Critical Summary (30 trials)
                                                           S0-4-0                                                                                                 S0-4-0-FI                                                                          S0-4-0-GI                                                                                                     SWE                                                                                                                      ITA
                                                                                          critical (C)
                             0                4                              0                                            0                             4    321                    0                                             0      4           3          2           1            0                       0                                                           1                                                     012 3                                          4                    3               2
                                                                                          dead (D)
                                                                                          infected (I)
                                                                                                         40                                                                                         40                                                                                           40                                                                                                                 40                                                                                                         40

                                                                                                         30                                                                                         30                                                                                           30                                                                                                                 30                                                                                                         30


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    persons
                                                                                                         20                                                                                         20                                                                                           20                                                                                                                 20                                                                                                         20

                                                                                                         10                                                                                         10                                                                                           10                                                                                                                 10                                                                                                         10

                                                                                                         0                                                                                          0                                                                                            0                                                                                                                  0                                                                                                          0
        0                                     25           50    75                    100                         0                                    25          50    75             100                 0                           25            50    75                       100                0                                          25            50    75                         100                             0                                     25         50    75                       100
                                                         time (days)                                                                                              time (days)                                                                        time (days)                                                                                                time (days)                                                                                                       time (days)

                                                              (a)                                                                                                    (b)                                                                                  (c)                                                                                                              (d)                                                                                                              (e)

                                                  Infection Peak (normalized)                                                                                Critical (> max capacity)                                                               Deaths (normalized)                                                                                          Pandemic Duration                                                                                                       Economic Costs


                                                                                                                       persons x days / max capacity
                                  0.35                                                                                                                                                                                                                                                                                                                                                                                                                              0
      persons / population size


                                                                                                                                                                                                          persons / population size
                                                                                                                                                       25                                                                             0.03                                                                                           20.0
                                  0.30                                                                                                                 20                                                                                                                                                                                                                                                                                                           2
                                                                                                                                                                                                                                                                                                                                     17.5
                                  0.25                                                                                                                 15                                                                             0.02                                                                                                                                                                                                                          4


                                                                                                                                                                                                                                                                                                         days
                                                                                                                                                                                                                                                                                                                                     15.0
                                  0.20                                                                                                                 10                                                                                                                                                                                                                                                                                                           6
                                                                                                                                                                                                                                                                                                                                     12.5
                                  0.15                                                                                                                                                                                                0.01
                                                                                                                                                       5                                                                                                                                                                                                                                                                                                            8
                                                                                                                                                                                                                                                                                                                                     10.0
                                  0.10
                                                                                                                                                       0
                                                      -0              -FI        -0-G
                                                                                      I    WE            ITA                                                    -0        -FI       -0-G
                                                                                                                                                                                         I
                                                                                                                                                                                             SW
                                                                                                                                                                                                E   ITA                                           -0                -FI            -GI   SW
                                                                                                                                                                                                                                                                                           E      ITA                                                        -0                  -FI            -GI    SW
                                                                                                                                                                                                                                                                                                                                                                                                         E           ITA                                                         -0             -0-F
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -0-G
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   I
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       SW
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          E    ITA
                                                  S0-4     S0-4
                                                                    -0                    S                                                                 S0-4   S0-4
                                                                                                                                                                       -0                                                                     S0-4       S0-4
                                                                                                                                                                                                                                                               -0                -0                                                                      S0-4     S0-4
                                                                                                                                                                                                                                                                                                                                                                            -0                -0                                                                          S0-4            S0-4
                                                                            S0-4                                                                                                S0-4                                                                                      S0-4                                                                                                         S0-4                                                                                                               S0-4

                                                                                 (f)                                                                                              (g)                                                                                      (h)                                                                                                           (i)                                                                                                                (j)


                                                           Figure 5: Simulator dynamics under different hand constructed and reference government policies.
                                                  Global Infection Summary                                                                                   Critical Summary                                                            Stages over Time


                                                                                                                                                                                                                                                                                                                                                                                                                                   persons x days / max capacity
                                                           (30 trials)                                                                                           (30 trials)                                                            (shown for 3 trials)                                                                                              Infection Peak (normalized)                                                                                         Critical (> max capacity)
                                                                                                                                                                                                                                                                                             Lockdown
                                  S0-4-0-GI


                                                                                                                                                                                                                                                                                                                      persons / population size
                                                    critical (C)                                               1000                                                                                                                                                                          (Stage-4)
                                                    dead (D)
                                                    infected (I)
                                                                                                                                                                                                    20                                                                                                                                            0.30
                                                    none (N)                                                                                                                                                                                                                                                                                                                                                                                                       10
                                                    recovered (R)
                                                                                                                                                                                                                                                                                             Open                                                 0.25
                                                                                                               0                                                                                    0                                                                                        (Stage-0)
                                              0          25          50            75         100                                                0           25      50          75      100                   0                         25          50         75           100
                                                                                                                                                                                                                                                                                                                                                  0.20
                                                                            (a)                                                                                        (b)                                                                               (c)                                                                                                                                                                                                        5
                                                                                                                                                                                                                                                                                             Lockdown                                             0.15
                                                                                                               1000
                                  Learned


                                                                                                                                                                                                                                                                                             (Stage-4)
                                                                                                                                                                                                    20                                                                                                                                            0.10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0
                                                                                                                                                                                                                                                                                                                                                                    I            d                 3            7            k                                                   I      d       3        7          k
                                                                                                               0                                                                                                                                                                             Open                                                               -0-G          rne              l_w           l_w        l_10                                                 -0-G earne      l_w     l_w       l_10
                                                                                                                                                                                                    0                                                                                        (Stage-0)                                                   S0-4              Lea           Eva           Eva          Eva                                                  S0-4      L      Eva     Eva      Eva
                                              0          25          50            75         100                                                0           25      50          75      100                   0                         25          50         75           100
      Policies


                                                                         (d)                                                                                           (e)                                                                               (f)
                                                                                                                                                                                                                                                                                             Lockdown
                                                                                                                                                                                                                                                                                                                                                                                         (m)                                                                                                               (n)
                                                                                                               1000
                                  Eval_w3


                                                                                                                                                                                                                                                                                             (Stage-4)                                                            Deaths (normalized)                                                                                             Cumulative Reward
                                                                                                                                                                                                    20                                                                                                                                                                                                                                                             0.0


                                                                                                                                                                                                                                                                                                         persons / population size
                                                                                                                                                                                                                                                                                             Open
                                                                                                               0                                                                                    0                                                                                        (Stage-0)                                   0.020                                                                                                                     2.5
                                              0          25          50            75         100                                                0           25      50          75      100                   0                         25          50         75           100
                                                                         (g)                                                                                           (h)                                                                               (i)                                                                                                                                                                                                       5.0
                                                                                                                                                                                                                                                                                                                                         0.015
                                                                                                                                                                                                                                                                                             Lockdown
                                  Eval_10k


                                                                                                               10000                                                                                                                                                                         (Stage-4)                                                                                                                                                             7.5
                                                                                                                                                                                                    200
                                                                                                                                                                                                                                                                                                                                         0.010
                                                                                                                                                                                                                                                                                                                                                                                                                                                 10.0
                                                                                                                                                                                                                                                                                             Open
                                                                                                               0                                                                                    0                                                                                        (Stage-0)
                                              0          25           50    75                100                                                0           25      50    75            100                   0                         25       50    75                   100                                                                                       I
                                                                                                                                                                                                                                                                                                                                                                                 rne
                                                                                                                                                                                                                                                                                                                                                                                    d              3           7               k                                                      I           d             3         7         k
                                                                                                                                                                                                                                                                                                                                                                -0-G
                                                                                                                                                                                                                                                                                                                                                                           Lea           Eva
                                                                                                                                                                                                                                                                                                                                                                                               l_w
                                                                                                                                                                                                                                                                                                                                                                                                       Eva
                                                                                                                                                                                                                                                                                                                                                                                                          l_w             l_10                                                -0-G
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Lea
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             rne
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Eva
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             l_w
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Eva
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      l_w      l_10
                                                                    time (days)                                                                                    time (days)                                                                  time (days)                                                                                              S0-4                                                       Eva                                                  S0-4                                               Eva

                                                                            (j)                                                                                        (k)                                                                               (l)
                                                                                                                                                                                                                                                                                                                                                                                          (o)                                                                                                              (p)


Figure 6: Simulator runs comparing the S0-4-0-GI heuristic policy with a learned policy. The figure also shows results of the
learned policy evaluated at different action frequencies and in a larger population environment.


ures 6(m-p). Further, we can see how the learned policy re-                                                                                                                                                                                                                           of infections under control. Whether this behavior is specific
acts to the state of the pandemic; Figure 6(f) shows differ-                                                                                                                                                                                                                          to school reopening is one of many interesting questions that
ent traces through the regulation space for 3 of the trials.                                                                                                                                                                                                                          this type of simulator allows us to investigate.
The learned policy briefly oscillates between Stages 3 and 4
around day 40. To minimize such oscillations, we evaluated                                                                                                                                                                                                                                                                                                                             6               Conclusion
the policy at an action frequency of one action every 3 days
(bi-weekly; labeled as Eval w3) and every 7 days (weekly;                                                                                                                                                                                                                             Epidemiological models aim at providing predictions re-
labeled as Eval w7). Figure 6(p) shows that the bi-weekly                                                                                                                                                                                                                             garding the effects of various possible intervention policies
variant performs well, while making changes only once a                                                                                                                                                                                                                               that are typically manually selected. In this paper, instead,
week slightly reduces the reward. To test robustness to scal-                                                                                                                                                                                                                         we introduce a reinforcement learning methodology for op-
ing, we also evaluated the learned policy (with daily actions)                                                                                                                                                                                                                        timizing adaptive mitigation policies aimed at maximizing
in a town with a population of 10,000 (Eval 10k) and found                                                                                                                                                                                                                            the degree to which the economy can remain open with-
that the results transfer well. This success hints at the pos-                                                                                                                                                                                                                        out overwhelming the local hospital capacity. To this end,
sibility of learning policies quickly even when intending to                                                                                                                                                                                                                          we implement an open-source agent-based simulator, where
transfer them to large cities.                                                                                                                                                                                                                                                        pandemics can be generated as the result of the contacts and
                                                                                                                                                                                                                                                                                      interactions between individual agents in a community. We
   This section presented results on applying RL to optimize                                                                                                                                                                                                                          analyze the sensitivity of the simulator to some of its main
reopening policies. An interesting next step would be to                                                                                                                                                                                                                              parameters and illustrate its main features, while also show-
study and explain the learned policies as simpler rule based                                                                                                                                                                                                                          ing that adaptive policies optimized via RL achieve better
strategies to make it easier for policy makers to implement.                                                                                                                                                                                                                          performance when compared to heuristic policies and poli-
For example, in Figure 6(l), we see that the RL policy waits                                                                                                                                                                                                                          cies representative of those used in the real world.
at stage 2 before reopening schools to keep the second wave                                                                                                                                                                                                                              While our work opens up the possibility to use machine
learning to explore fine-grained policies in this context,       Libin, P.; Moonens, A.; Verstraeten, T.; Perez-Sanjines, F.;
PANDEMIC S IMULATOR could be expanded and improved in            Hens, N.; Lemey, P.; and Nowé, A. 2020. Deep rein-
several directions. One important direction for future work      forcement learning for large-scale epidemic control. arXiv
is to perform a more complete and detailed calibration of        preprint arXiv:2003.13676 .
its parameters against real-world data. It would also be use-    Liu, C. 2020. A microscopic epidemic model and pandemic
ful to implement and analyze additional testing and contact      prediction using multi-agent reinforcement learning. arXiv
tracing strategies to contain the spread of pandemics.           preprint arXiv:2004.12959 .
                       References                                Liu, Q.-H.; Ajelli, M.; Aleta, A.; Merler, S.; Moreno, Y.; and
                                                                 Vespignani, A. 2018. Measurability of the epidemic repro-
Aleta, A.; Martı́n-Corral, D.; y Piontti, A. P.; Ajelli, M.;
                                                                 duction number in data-driven contact networks. Proceed-
Litvinova, M.; Chinazzi, M.; Dean, N. E.; Halloran, M. E.;
                                                                 ings of the National Academy of Sciences 115(50): 12680–
Longini Jr, I. M.; Merler, S.; et al. 2020. Modelling the im-
                                                                 12685. ISSN 0027-8424. doi:10.1073/pnas.1811115115.
pact of testing, contact tracing and household quarantine on
                                                                 URL https://www.pnas.org/content/115/50/12680.
second waves of COVID-19. Nature Human Behaviour 1–8.
Bansal, S.; Grenfell, B. T.; and Meyers, L. A. 2007. When        Metcalf, C. J. E.; and Lessler, J. 2017. Opportunities and
individual behaviour matters: homogeneous and network            challenges in modeling emerging infectious diseases. Sci-
models in epidemiology. Journal of the Royal Society In-         ence .
terface 4(16): 879–891.                                          Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.;         ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid-
Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym.        jeland, A. K.; Ostrovski, G.; et al. 2015. Human-level con-
arXiv preprint arXiv:1606.01540 .                                trol through deep reinforcement learning. nature 518(7540):
                                                                 529–533.
Cobey, S. 2020. Modeling infectious disease dynamics. Sci-
ence .                                                           Rivers, C. M.; and Scarpino, S. V. 2018. Modelling the tra-
                                                                 jectory of disease outbreaks works. Nature .
Del Valle, S. Y.; Mniszewski, S. M.; and Hyman, J. M. 2013.
Modeling the impact of behavior changes on the spread of         Song, S.; Zong, Z.; Li, Y.; Liu, X.; and Yu, Y. 2020. Rein-
pandemic influenza. In Modeling the interplay between hu-        forced Epidemic Control: Saving Both Lives and Economy.
man behavior and the spread of infectious diseases, 59–77.       arXiv preprint arXiv:2008.01257 .
Springer.                                                        Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learn-
Duque, D.; Morton, D. P.; Singh, B.; Du, Z.; Pasco, R.;          ing: An introduction. MIT press.
and Meyers, L. A. 2020. COVID-19: How to Relax So-               Tolles, J.; and Luong, T. 2020. Modeling Epidemics With
cial Distancing If You Must. medRxiv doi:10.1101/2020.04.        Compartmental Models. JAMA .
29.20085134. URL https://www.medrxiv.org/content/early/
                                                                 Xiao, Y.; Yang, M.; Zhu, Z.; Yang, H.; Zhang, L.; and
2020/05/05/2020.04.29.20085134.
                                                                 Ghader, S. 2020. Modeling indoor-level non-pharmaceutical
Grefenstette, J. J.; Brown, S. T.; Rosenfeld, R.; DePasse, J.;   interventions during the COVID-19 pandemic: a pedestrian
Stone, N. T.; Cooley, P. C.; Wheaton, W. D.; Fyshe, A.; Gal-     dynamics-based microscopic simulation approach. arXiv
loway, D. D.; Sriram, A.; et al. 2013. FRED (A Framework         preprint arXiv:2006.10666 .
for Reconstructing Epidemic Dynamics): an open-source
software system for modeling infectious diseases and con-
trol strategies using census-based populations. BMC public
health 13(1): 1–14.
Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Re-
inforcement Learning with a Stochastic Actor. In Interna-
tional Conference on Machine Learning, 1861–1870.
Hoertel, N.; Blachier, M.; Blanco, C.; Olfson, M.; Massetti,
M.; Sánchez Rico, M.; Limosin, F.; and Leleu, H. 2020. A
stochastic agent-based model of the SARS-CoV-2 epidemic
in France. Nature Medicine .
Khadilkar, H.; Ganu, T.; and Seetharam, D. P. 2020. Op-
timising Lockdown Policies for Epidemic Control using
Reinforcement Learning. Transactions of Indian National
Academy of Engineering .
Larremore, D. B.; Wilder, B.; Lester, E.; Shehata, S.; Burke,
J. M.; Hay, J. A.; Tambe, M.; Mina, M. J.; and Parker,
R. 2020. Test sensitivity is secondary to frequency and
turnaround time for COVID-19 surveillance. MedRxiv .