Robust Lock-Down Optimization for COVID-19 Policy Guidance
    Ankit Bhardwaj2* , Han Ching Ou1* , Haipeng Chen1 , Shahin Jabbari1 , Milind Tambe1 , Rahul
                                    Panicker2 , Alpan Raval2
                      1
                          Harvard University, hou@g.harvard.edu, {hpchen, jabbari , tambe}@seas.harvard.edu
                                     2
                                       WadhwaniAI, {bhardwaj, rahul, alpan}@wadhwaniai.org
                                                       ∗ Equal Contribution


                              Abstract                                   In addition to insufficient data and uncertainty, for our
  As the COVID-19 outbreak continues to pose a serious world-         problem, it is hard to specify a single objective that one
  wide threat, numerous governments choose to establish lock-         wants to achieve. It is likely that many objectives need to be
  downs in order to reduce disease transmission. However, im-         met for the task to be considered successful. For example,
  posing the strictest possible lock-down at all times has dire       we may want to delay the peak of infections while making
  economic consequences, especially in areas with widespread          sure that our hospitals are not overburdened or our economy
  poverty. In fact, many countries and regions have started           is not affected too severely. The decision maker in this case
  charting paths to ease lock-down measures. Thus, planning           is looking at the problem from several perspectives leading
  efficient ways to tighten and relax lock-downs is a crucial and     to many possible objectives that the model will be evaluated
  urgent problem. We develop a reinforcement learning based           on. Thus, in this work, we incorporate multi-objective func-
  approach that is (1) robust to a range of parameter settings,
                                                                      tions.
  and (2) optimizes multiple objectives related to different as-
  pects of public health and economy, such as hospital capacity          The main contributions of this work can be summarized
  and delay of the disease. The absence of a vaccine or a cure        as follows:
  for COVID to date implies that the infected population cannot       • We formulate the problem of lock-down implementation
  be reduced through pharmaceutical interventions. However,             as a Markov Decision Process (MDP). To solve this MDP,
  non-pharmaceutical interventions (lock-downs) can slow dis-           we propose a Reinforcement Learning (RL) approach that
  ease spread and keep it manageable. This work focuses on
  how to manage the disease spread without severe economic
                                                                        optimizes the trade-off between health objectives and eco-
  consequences.                                                         nomic cost.
                                                                      • We tackle the uncertainty in environment parameters that
                           Introduction                                 might arise from the noise in the data and the process of
                                                                        estimation by considering different robust approaches.
While governments are responding to the spread of COVID-
19 by imposing lock-downs of varying intensity to reduce              • We analyse different robust approaches including uni-
human-human contact, the situation cannot be maintained                 form sampling and adversarial sampling during the train-
indefinitely. Each day of lock-down brings severe economic              ing phase. We find that there is a trade-off relation in the
loss affecting the livelihood of billions. Thus, it is imperative       average-case and worst-case between RL agents with dif-
to use the available resources of interventions – lock-downs,           ferent degrees of risk aversion.
test-kits, ventilators etc., in an efficient manner. This work        • We design different health objectives that might be of in-
aims to find optimal lock-down policies based on epidemio-              terest to decision-makers and measure our performance
logical models and reinforcement learning.                              along these different objectives simultaneously.
   Reinforcement learning has shown promising results on
sequential decision making tasks like Go (Silver et al. 2016)            With this work, we aim to address the challenging task of
and autonomous driving (Pan et al. 2017). On these tasks,             planning temporal resource allocation for lock-downs. The
training from real-world data directly is too expensive due           models that we use for modelling the spread of COVID are
to costly data collection process. Learning the agent model           the SEIR class of epidemiological models.
from simulations is thus necessary. However, simulations
don’t reflect the real world exactly due to uncertainties trans-                           Previous Work
ferred while fitting the simulation model (Christiano et al.          Since as early as the 17th century, when Bernoulli pro-
2016). It can be dangerous if the policy learned is unaware           posed the first mathematical epidemic model for small-
of such uncertainties, which is especially true for our task.         pox (Bernoulli and Blower 2004), there have been numer-
AAAI Fall 2020 Symposium on AI for Social Good.                       ous efforts in the modeling and control of epidemics. One
Copyright © 2020 for this paper by its authors. Use permitted under   important class of these models are called compartmental
Creative Commons License Attribution 4.0 International (CC BY         models. These models, as their name suggests, divide the
4.0).                                                                 population into different health states (compartments) and
model transitions of populations between these health states.        Notations       Definition
The underlying assumption is that these compartments have                                Health States
homogeneously mixed populations. Susceptible-Exposed-                S               Susceptible fraction of the population
Infected-Recovered (SEIR) family of models are compart-              E               Exposed fraction of the population
mental models with dynamics described by ordinary differ-            I               Infected fraction of the population
ential equations. Recently, there have been advances in fit-         R               The fraction of the population that is
ting SEIR models with machine learning techniques (Bannur                            either completely recovered or is un-
et al. 2021). In this work, we use a Susceptible-Exposed-                            dergoing recovery and is no longer in-
Infected-Recovered-Deceased (SEIRD) model (detailed de-                              fectious
scription in subsection Epidemic Model) to model the                 D               Deceased fraction of the population
COVID-19 data. However, the technique we propose is ap-                                  Transmission
plicable to any of the SEIR family models.                           t               Current time (day)
   Apart from epidemic modeling, the problem of optimiz-             Ro              Basic Reproductive Number
ing cure and control for preventing the spread of disease is         Tinc            The incubation time
also of interest. However, most works in the computer sci-           Tinf            The duration for individuals being in-
ence literature usually assume an idealistic model, such as                          fectious
every contact being known, no uncertainty in the disease pa-         Trecover        Time for individuals to recover or
rameters or a strong cure/isolation that guarantees the re-                          quarantine in hospital
covery of the individuals (Ball, Knock, and O’Neill 2015;            Tf atal         Time for a fatal infection individual to
Sun and Hsieh 2010; Wang 2005; Zhang and Prakash 2015;                               die
Ganesh, Massoulié, and Towsley 2005). None of these are                                  Intervention
true for most real-world diseases, such as the newly arisen          a               Action (intervention strength)
COVID-19 pandemic which has no cure as of the writing of             l(a)            Cost of the action per day
this paper. Even under most settings being ideal, a small un-        e               Lock-down effectiveness coefficient
certainty could have serious implications on outcomes if not         d               Minimum duration of the intervention
handled properly. For example, the impact of curing uncer-           Ttrans          Transition delay after lock-down a is
tainty under perfect observation is analyzed in Hoffman and                          deployed
Caramanis (Hoffmann and Caramanis 2018) by providing                                       Objectives
non-constructive, algorithm-independent bounds. We aim to            λ               Hospital Capacity
address the challenging setting in which there are uncertain-        δ               Economic-Health cost weight
ties in most of the parameters in the model.
   Robust control is a branch of control theory that has a long          Table 1: The notations used across this paper.
history. In particular, robustness toward parameter uncer-
tainty results in a performance drop from the model toward
its real-world application(Mannor et al. 2004). Numerous          ties are often considered and solved when the system is easy
works (Nilim and El Ghaoui 2005, 2004; White III and El-          to describe (Censor 1977). Multi-Objective Reinforcement
deib 1994) have tried to tackle such uncertainty under the ro-    Learning however (Roijers et al. 2013; Van Moffaert and
bust MDP framework with different assumptions. In recent          Nowé 2014), is a relatively new research area that has been
years, Reinforcement learning has demonstrated promising          actively studied only in recent years. In this work, we de-
results on a variety of MDP problems (Silver et al. 2016;         signed a Quality Adjusted Life Year (QALY) value function
Pan et al. 2017). For applications with a high safety require-    to calculate the suitable reward signal for any point of any
ment, it is natural to combine robustness into reinforcement      two objectives of the QALY variant. Such a function can
learning (Mihatsch and Neuneier 2002; Carpin, Chow, and           also be used to determine the difficulties of optimizing the
Pavone 2016; Chow et al. 2017). Among these works, using          two objectives simultaneously.
an adversarial agent to adjust the environment and discover
potential risk systematically has shown promising results in                              Modelling
many real-world tasks (Pinto, Davidson, and Gupta 2017;
Pattanaik et al. 2017). A recent algorithm using an adversar-     Epidemic Model
ial framework is robust adversarial reinforcement learning        For modelling COVID-19, we adopt a discrete time SEIRD
(RARL) (Pinto et al. 2017) in which, two agents are trained,      model (Weitz and Dushoff 2015). The SEIRD class of mod-
one protagonist and other an adversary providing attacks on       els is a part of compartmental models, as mentioned above.
input states and dynamics. In our work, similar to RARL,          An individual can be in one of the following health states:
we use an adversarial agent to systematically search risky        S (a healthy individual susceptible to disease), E (the indi-
environmental parameters for the policy.                          vidual has been exposed and has latent disease), or I (the
   Another important consideration for lock-down policy           individual is infected), R (the individual is recovering and is
makers might be to include different desirable objectives         no longer infectious to others) and D (the individual is de-
in their decision-making. Multi-objective optimization has        ceased). Table 1 summarizes the symbols we use throughout
tremendous practical importance in many real-life applica-        this paper.
tions (Deb 2014). Linear combinations of Pareto optimali-            The discrete-time dynamics equations for our epidemic
model are:                                                        of age, Social distancing of entire population and Closure
                                                                  of schools and universities for non-pharmaceutical interven-
                              −St It                              tions in British population, which all have different cost and
              St+1 − St =                ,                 (1)    effectiveness.
                          Ttrans (a, e)                              These lock-down policies should enjoy several desider-
                                 St           Et                  ata for real-world deployment. First, each type of interven-
             Et+1 − Et = (                 −     ),        (2)
                            Ttrans (a, e) Tinc                    tion needs to last for a minimum duration d. Second, since
                             Et       It                          the lock-down has economic cost, government bodies would
              It+1 − It = (       −        ),              (3)    expect a trade-off between the total budget B spent for plan-
                            Tinc     Tinf
                                                                  ning and policy deployment and public health related gains.
                              It                                  To model such interventions, we considered lock-downs as a
             Rt+1 − Rt =            ,                      (4)
                          Trecover                                series of action choices. The decision maker can plan differ-
                             Rt                                   ent policies in different time periods based on the limitations
             Dt+1 − Dt =          ,                        (5)    mentioned above.
                          Tf atal
                                                                     During the lock-down period in India, the change in es-
              1        1          1                               timated transmission time as the effect of interventions has
   in which Tinf = Trecover  + Tf atal and the basic repro-
ductive number can be obtained from R0 = Tinf /Ttrans .           been observed based on (Group 2020). This corresponds to
A typical SEIRD model described above starts from a popu-         Ttrans in the SEIRD model we proposed and the effective-
lation being mostly susceptible and a small fraction of in-       ness e varies in different regions. Thus we modeled the ac-
fectious people. When R0 > 1, each infected individual            tion as extending the transmission time to different degrees
will infect more than one susceptible individual in its life-     and different costs per day. Such a sequence of actions forms
time on average. Each susceptible individual will eventually      an intervention vector a of length T as a planning schedule
go through the exposed, infectious to finally recovered or        with a total cost sum.
deceased states. A schematic diagram of the SEIRD model
is shown in Figure 1. Generally in compartmental models,          Multi-Objective Functions
                                                                  In the public health domain, governments and decision mak-
                                                                  ers may want to achieve different objectives when deploying
                                                                  a policy. One direct objective could be eliminating the dis-
                                                                  ease which can be achieved by suppressing contacts so that
                                                                  patients recover at a rate greater than the spread of infection.
                                                                  This is equivalent to minimizing the area under the infection
                                                                  curve. However, this is not achievable in many regions, in-
                                                                  cluding cities in many of the developing countries due to the
   Figure 1: A schematic diagram of the SEIRD model.              huge economic cost of such strict lock-downs. Thus, we fo-
                                                                  cus on economically sustainable interventions that do not re-
Ttrans is a constant. However, in a real-world setting, the       duce R0 below 1. In epidemic theory, this means the disease
transmission time can be reduced through the deployment           cannot be eliminated within reasonable time no matter how
of non-pharmaceutical lock-down interventions. Note that          the government plans the lock-down in these regions. Every
there is no direct reduction in infected population as there is   susceptible individual will eventually go through the recov-
no cure or vaccination available. We only consider the lock-      ered or deceased state. In other words, although the infection
down interventions that increase the transmission time of the     curve will change, the area under the curve will remain the
virus based on their strength.                                    same.
   Compartmental models and their variants are commonly              To evaluate the effectiveness of lock-down policies un-
used in disease state forecasting and prediction. For con-        der these circumstances, we use indirect objectives that are
creteness, in this work, state populations and numerical val-     vital and achievable for sustainable interventions. For exam-
ues of the transmission parameters are based on the available     ple, as there is limited hospital capacity, a patient’s qual-
data for the city of Mumbai (Group 2020). However, it is          ity of life will likely be better when the infected population
worth noting that both the intervention model and planning        does not exceed that capacity and they can receive proper
algorithm we propose apply for most, if not all, of the SEIR      treatment. Alternatively, we may want to delay the infec-
model variants.                                                   tion to the point when we have better system preparedness,
                                                                  medicines, resources etc. for handling the disease. These dif-
Intervention Modelling                                            ferent desired objectives can be described as a family of ob-
As there is no cure/vaccine available for COVID-19 till date,     jective functions that are variants of the Quality Adjusted
models of pharmaceutical interventions are not applicable.        Life Year (QALY) score in our model, which is elaborated
To manage the rapid disease spread, different lock-down           below.
policies can be considered to limit individual contacts. For         QALY is a popular established metric to quantify the ef-
example, (Ferguson et al. 2020) considered different lev-         fectiveness of health interventions. It is often used in the
els of lock-down such as Case isolated at home, Voluntary         public health literature (Salomon et al. 2012). It measures
home quarantine, Social distancing of those over 70 years         the effectiveness of a certain intervention by combining
quantity and quality of health improvement. Specifically, a        more effectiveness. For simplicity, we adopt a linear map-
person’s life quality at any given time is mapped to [0, 1],       ping for both cost and effectiveness, as:
with quality 1 corresponding to perfect health while 0 corre-
sponding to death. QALY accumulates such measurements                                                          Tinf
                                                                                 Ttrans (a, e) = (1 + ec(a))                  (9)
over time as its final score. Naturally, different disease con-                                                 R0
ditions lie in the range [0, 1] depending on severity.             and c : A → [0, 1], in which e is the lock-down effec-
   In this work, we change the time scale from years to days       tiveness coefficient and R0 the basic reproduction number
to adapt to the dynamics of the disease we are facing. We          when there are no lock-down interventions. Both of these
mainly focus mainly on two objectives, burden and delay.           are estimated with data from the city of Mumbai, India.
We define these two objective functions as:                        Ttrans and Tinf are the transmission and infection time pe-
                 X                                                 riods.
   OBurden =        ((I(t) − λ)1I(t)>λ − δBurden l(at )) (6)
                                                                      For the remaining tuple, the transition distribution
                 t
               X                                                   P(s′ |s, a) is described as the disease transmission equa-
    ODelay =         (tI(t) − δDelay l(at )).               (7)    tion 1 to 5. The total accumulated reward is exactly the ob-
                 t                                                 jective function O in equations 6, 7. The next section de-
   where t refers to timestep. at , I(t) and l(at ) refer to ac-   scribes the distribution of individual reward signals across
tion, infected population fraction and cost of action at time      states.
t respectively. Also δ and λ refer to economy-health weight
and hospital capacity and 1 is the indicator function. Here,              Reinforcement Learning Approach
we focus on optimizing a linear combination of these two           Multiple Objectives: We have defined the state, action,
objective functions, written as:                                   transition probabilities and total reward in the MDP sec-
                                                                   tion. The only missing piece for a complete reinforcement
         Omix (w) = wŌBurden + (1 − w)ŌDelay ,            (8)
                                                                   learning framework is to design the reward signal at every
   where the weight w ranges from 0 to 1 and Ō is O normal-       timestep. We have designed a framework that works not only
ized by the absolute value of no intervention, i.e., we divide     for the two example objectives we focus in this work, but on
O by its absolute value in the absence of interventions.           most variants of QALY.
                                                                      Most variants of QALY, including our examples, are re-
               Formulation Using MDP                               lated to time and population of certain health states. We pro-
                                                                   pose a function we call the QALY value V (x, t) which is a
Our lock-down control problem can be modeled as a Markov           function of the population x in a certain health state and time
Decision Process (MDP) (Yang, Sun, and Narasimhan                  t. We focus only on I or the Infectious state in these exper-
2019). Over the last two decades, reinforcement learn-             iments. However, the QALY value function can be gener-
ing (Sutton et al. 1998) has provided an effective framework       alized to a vector form to include multiple states. For con-
for solving an MDP in both theory and application. This is         trolling the hospital capacity, the function can be formulated
especially true when the system dynamics is either compli-         either as a constant penalty for x exceeding the capacity or
cated, unknown, or the state dimensionality is too high for        simply as a reward for x below the threshold, since the area
classical optimal control methods. In addition, the environ-       under the infection curve is a constant, as we elaborate in
ment parameters we estimate from real-world hospital data          section Multi-Objective Functions. We formalize this as:
involve uncertainty that cannot be ignored. Thus the output
policy needs to be robust to such uncertainty.                                     VBurden (x, t) = 1 for x < λ              (10)
   We thus consider a parameter-wise robust reinforcement
learning model to solve the MDP framework. The MDP                 As for delay, we formalize this function as
framework can be written as:                                                                              t
                                                                                        VDelay (x, t) =                      (11)
                          hS, A, P, Ri                                                                    T
with state space S, action space A, and transition distribu-          Given that the QALY value V (x, t) of the objective func-
tion and vector reward                                             tion is defined, the reward signal at any given time t can be
                                                                                          R I(t)
                                                                   calculated by r(t) = 0 V (x, t)dt. We can thus apply the
             P(s′ |s, a) for s, s′ ∈ S and a ∈ A
                                                                   reinforcement learning approach.
                            r(s) ∈ R                                  One benefit of such a proposed approach is that the QALY
and the preference weight w ∈ Rn .                                 value function of the mixed objective can be easily calcu-
   The states we consider are the fractions of population          lated as:
present in S,E,I,R and D compartments at the given time.                             wVBurden          (1 − w)VDelay
Furthermore, we consider several discrete actions at each            Vmix (w) =                      +
                                                                                  OBurden (no action) ODelay (no action)
time step corresponding to different strengths of lock-down                                                           (12)
with different costs. It is natural to assume the strength to be
monotonically increasing with the cost as otherwise the ac-        Uncertainty: Another important aspect other than having a
tion choice will be dominated by actions with less cost but        multi-objective function in the lock-down application is the
uncertainty of the parameters (e, Tinf , Tinc ), which are re-                 Model Performance on Burden
lated to the infection curve directly or indirectly. We exper-        Model           Worst    Mean       Std
iment with three approaches to analyze the effect of uncer-           Random          -3.625   -2.030     0.500
tainty in a reinforcement learning setup:                             FRL             -2.385   -1.234     0.372
(1)Fixed RL(FRL): Train the RL agent using only the mean              DRL             -2.445   -1.313     0.467
of the uncertain parameters.                                          ARL             -2.226   -1.279     0.385
(2)Distributed RL(DRL): Train the RL agent using sam-
ples of uncertain parameters from the estimated range.                      Table 2: Burden objective in equation 6.
(3)Adversarial RL(ARL): Inspired by (Pinto et al. 2017),
train the RL agent with another adversarial RL agent that                       Model Performance on Delay
will maliciously pick the worst possible parameter set for the        Model           Worst     Mean       Std
RL agent during training. Note that the worst case parame-            Random          -759.522 2.110       186.413
ter is not trivial to find as the policy changes. The action of       FRL             -29.486   163.403    39.201
the adversarial RL agent is set to be the discrete uncertain          DRL             -100.538 235.391     110.176
parameters in the disease model.                                      ARL             9.933     189.307    50.813

                      Experiments                                           Table 3: Delay objective in equation 7.
In this section, we describe the application of our method
to a specific location – the city of Mumbai, India. In subse-
quent subsections, we describe how we estimate the model          the performance of different methods on their corresponding
parameters as well as the uncertainty in these parameters.        worst-case parameters. We also find the corresponding aver-
We also report the results of our method when used on the         age performance over the parameter distribution. The results
estimated parameter ranges.                                       are tabulated in Tables 2 and 3. As shown in these tables, the
                                                                  ARL helps the reinforcement learning discover risky param-
Parameter and Uncertainty Estimation                              eters and thus performs best in its worst case scenario. For
We fit our SEIRD model to the time-series data from the           average case, however, ARL performs worse than the best
COVID19-India API (Group 2020) for the city of Mumbai.            method (FRL and DRL respectively). This has shown the
The data is aggregated in fields called Recovered, Deceased,      trade-off between performance and robustness in our lock-
Hospitalized and Total Infected, where Total Infected = Re-       down problem - at the cost of average performance, we can
covered + Hospitalized + Deceased. In our SEIRD model,            obtain better worst-case performance.
we fit the I compartment to Total Infected, D compartment         Different objectives: We use different weights between
to Deceased and R compartment to Hospitalized + Recov-            Burden and Delay objectives and compare the results to the
ered. In this sense, the R compartment in our model esti-         case when we individually focus on Delay and Burden in
mates people who are either under recovery or have recov-         Table 4. The objective function we use is (1 − w) ∗ Delay +
ered, and thus are no longer infectious.                          w ∗ Burden for different values of w. The aim is to maximize
   We decided the initial search space for the model pa-          normalized objective for both Burden and Delay.
rameters based on the estimates given by public health ex-           From Table 4, we observe that, as expected, as the weight
perts and those cited in literature. We process the data with     on Burden increases, the Burden objective becomes larger
smoothing techniques to reduce the effect of bulk data en-        for all methods in general. Similar behaviour is observed for
try. Then, we search over the parameter space for parameter       Delay as well. When applying this method for policy guid-
sets that have a small aggregated RMSE loss between pre-          ance, we can tune w to achieve the required objectives for
dicted numbers and actual numbers using the Hyperopt li-          both Burden and Delay.
brary (Bergstra, Yamins, and Cox 2013). The parameter set
giving the least loss value is taken to be the best-fit parame-
ter set for the purposes of this experiment.
                                                                            Conclusions and Future Work
   We found that there are diverse parameter sets that have       We implemented reinforcement learning on the lock-down
loss close to the best-fit parameter set. Thus, we picked all     policy optimization problem for COVID-19 while consider-
parameter sets that have a loss within a certain range of the     ing important real-world aspects like robustness and multi-
best loss (within 10%). Among all picked parameter sets,          objective optimization. Robustness can be achieved by intro-
we find the range of values taken by individual parameters.       ducing an adversarial agent for parameter discovery, but at
These ranges for individual parameters give us a measure          the cost of sacrificing some performance on average. For the
of uncertainty for these parameters. We assume a uniform          multi-objective mixture, we study the trade-off between con-
distribution over these ranges as our parameter distribution.     trolling hospital capacity and delaying the infection spread.
                                                                  We proposed a reward distribution framework for the rein-
Analysis and Results                                              forcement learning agent to shift from one objective to an-
Robustness: Robustness of policy to uncertainty in parame-        other in the lock-down problem. One point to note is that our
ters is an important aspect. Over the estimated uniform dis-      epidemiological model (SEIRD) is a homogeneous model
tribution range, we find the worst-case parameters for dif-       and is being used to optimize the policy keeping the trade-off
ferent methods using a fine grid-search. Then, we measure         between economy and health for the community as a whole.
    Model         w        Burden       Delay     Mixed                                 References
    Random                 -1.074       0.675     0.675
    FRL                    -1.372       1.478     1.478          Ball, F. G.; Knock, E. S.; and O’Neill, P. D. 2015. Stochas-
                  0.0                                            tic epidemic models featuring contact tracing with delays.
    DRL                    -1.155       1.711     1.711
    ARL                    -1.442       1.727     1.727          Mathematical biosciences 266: 23–35.
    Random                 -1.230       0.843     0.318          Bannur, N.; Maheshwari, H.; Jain, S.; Shetty, S.; Merugu,
    FRL                    -1.017       1.154     0.611          S.; and Raval, A. 2021. Adaptive COVID-19 Forecasting
                  0.25
    DRL                    -1.086       1.673     0.983          via Bayesian Optimization. In Proceedings of the ACM
    ARL                    -0.993       1.103     0.579          India Joint International Conference on Data Science and
    Random                 -1.073       0.641     -0.216         Management of Data, CoDS-COMAD ’21. New York, NY,
    FRL                    -0.622       1.207     0.293          USA: Association for Computing Machinery. doi:10.1145/
                  0.5
    DRL                    -0.738       1.196     0.229          3430984.3431047. URL https://doi.org/10.1145/3430984.
    ARL                    -0.912       1.208     0.148          3431047.
    Random                 -1.019       0.670     -0.597
    FRL                    -0.550       1.090     -0.140         Bergstra, J.; Yamins, D.; and Cox, D. 2013. Making a sci-
                  0.75                                           ence of model search: Hyperparameter optimization in hun-
    DRL                    -0.818       1.159     -0.324
    ARL                    -0.703       0.932     -0.294         dreds of dimensions for vision architectures. In Interna-
    Random                 -1.193       0.587     -1.193         tional conference on machine learning, 115–123.
    FRL                    -0.734       1.400     -0.734         Bernoulli, D.; and Blower, S. 2004. An attempt at a new
                  1.0
    DRL                    -0.620       1.014     -0.620         analysis of the mortality caused by smallpox and of the ad-
    ARL                    -0.671       1.001     -0.671         vantages of inoculation to prevent it. Reviews in medical
                                                                 virology 14(5): 275–288.
Table 4: Model Performance for Mixed Objectives. The
scores are calculated based on equation 12                       Carpin, S.; Chow, Y.-L.; and Pavone, M. 2016. Risk aversion
                                                                 in finite Markov Decision Processes using total cost criteria
                                                                 and average value at risk. In 2016 ieee international confer-
                                                                 ence on robotics and automation (icra), 335–342. IEEE.
The model does not discriminate between two infected in-
dividuals based on their economic contribution and neither       Censor, Y. 1977. Pareto optimality in multiobjective prob-
is it capable for the same. This makes sure that we generate     lems. Applied Mathematics and Optimization 4(1): 41–59.
lockdown policy as fairly as possible.
                                                                 Chow, Y.; Ghavamzadeh, M.; Janson, L.; and Pavone, M.
   The future direction of this work is to gather more data      2017. Risk-constrained reinforcement learning with per-
on both cost and effectiveness of the real-world lock-down       centile risk criteria. The Journal of Machine Learning Re-
policies on community scale so that a more complex model         search 18(1): 6070–6120.
can be used to better estimate the real-world scenarios. For
example, transmission times are known to not be homoge-          Christiano, P.; Shah, Z.; Mordatch, I.; Schneider, J.; Black-
neous and several super-spreader events have been identified     well, T.; Tobin, J.; Abbeel, P.; and Zaremba, W. 2016. Trans-
in many different spreading routes. Collecting data on such      fer from simulation to real world through learning deep in-
cases and modifying the model to have different transmis-        verse dynamics model. arXiv preprint arXiv:1610.03518 .
sion times for different cases of spread would give us a more    Deb, K. 2014. Multi-objective optimization. In Search
holistic view of the entire scenario. Another important direc-   methodologies, 403–449. Springer.
tion of extension would be estimating the reporting rate from
other sources of data and normalizing the reported numbers       Ferguson, N.; Laydon, D.; Nedjati-Gilani, G.; Imai, N.;
to estimate parameters that are closer to the real-world.        Ainslie, K.; Baguelin, M.; Bhatia, S.; Boonyasiri, A.; Cu-
                                                                 cunubá, Z.; Cuomo-Dannenburg, G.; et al. 2020. Report
                                                                 9: Impact of non-pharmaceutical interventions (NPIs) to re-
                  Acknowledgements                               duce COVID19 mortality and healthcare demand. Imperial
                                                                 College London 10: 77482.
This study is made possible by the generous support of
the American People through the United States Agency             Ganesh, A.; Massoulié, L.; and Towsley, D. 2005. The effect
for International Development (USAID) and Army Re-               of network topology on the spread of epidemics. In Pro-
search Office (ARO). The work described in this article          ceedings IEEE 24th Annual Joint Conference of the IEEE
was implemented under the TRACETB Project, managed               Computer and Communications Societies., volume 2, 1455–
by WIAI under the terms of Cooperative Agreement Num-            1466. IEEE.
ber 72038620CA00006 and by Teamcore, CRCS, Harvard               Group, C.-. I. O. D. O. 2020. Accessed on yyyy-mm-dd
University under Multidisciplinary University Research Ini-      from https://api.covid19india.org/.
tiative grant number W911NF1810208. The contents of this
manuscript are the sole responsibility of the authors and do     Hoffmann, J.; and Caramanis, C. 2018. The Cost of Un-
not necessarily reflect the views of USAID, ARO or the           certainty in Curing Epidemics. Proceedings of the ACM on
United States Government.                                        Measurement and Analysis of Computing Systems 2(2): 31.
Mannor, S.; Simester, D.; Sun, P.; and Tsitsiklis, J. N. 2004.   White III, C. C.; and Eldeib, H. K. 1994. Markov decision
Bias and variance in value function estimation. In Proceed-      processes with imprecise transition probabilities. Opera-
ings of the twenty-first international conference on Machine     tions Research 42(4): 739–749.
learning, 72.                                                    Yang, R.; Sun, X.; and Narasimhan, K. 2019. A gener-
Mihatsch, O.; and Neuneier, R. 2002. Risk-sensitive rein-        alized algorithm for multi-objective reinforcement learning
forcement learning. Machine learning 49(2-3): 267–290.           and policy adaptation. In Advances in Neural Information
Nilim, A.; and El Ghaoui, L. 2004. Robustness in Markov          Processing Systems, 14636–14647.
decision problems with uncertain transition matrices. In Ad-     Zhang, Y.; and Prakash, B. A. 2015. Data-aware vaccine al-
vances in neural information processing systems, 839–846.        location over large networks. ACM Transactions on Knowl-
Nilim, A.; and El Ghaoui, L. 2005. Robust control of             edge Discovery from Data (TKDD) 10(2): 20.
Markov decision processes with uncertain transition matri-
ces. Operations Research 53(5): 780–798.
Pan, X.; You, Y.; Wang, Z.; and Lu, C. 2017. Virtual to
real reinforcement learning for autonomous driving. arXiv
preprint arXiv:1704.03952 .
Pattanaik, A.; Tang, Z.; Liu, S.; Bommannan, G.; and
Chowdhary, G. 2017. Robust deep reinforcement learning
with adversarial attacks. arXiv preprint arXiv:1712.03632 .
Pinto, L.; Davidson, J.; and Gupta, A. 2017. Supervision via
competition: Robot adversaries for learning tasks. In 2017
IEEE International Conference on Robotics and Automation
(ICRA), 1601–1608. IEEE.
Pinto, L.; Davidson, J.; Sukthankar, R.; and Gupta, A. 2017.
Robust adversarial reinforcement learning. arXiv preprint
arXiv:1703.02702 .
Roijers, D. M.; Vamplew, P.; Whiteson, S.; and Dazeley,
R. 2013. A survey of multi-objective sequential decision-
making. Journal of Artificial Intelligence Research 48: 67–
113.
Salomon, J. A.; Vos, T.; Hogan, D. R.; Gagnon, M.; Naghavi,
M.; Mokdad, A.; Begum, N.; Shah, R.; Karyana, M.; Kosen,
S.; et al. 2012. Common values in assessing health out-
comes from disease and injury: disability weights measure-
ment study for the Global Burden of Disease Study 2010.
The Lancet 380(9859): 2129–2143.
Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.;
Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;
Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the
game of Go with deep neural networks and tree search. na-
ture 529(7587): 484–489.
Sun, C.; and Hsieh, Y.-H. 2010. Global analysis of an SEIR
model with varying population size and vaccination. Applied
Mathematical Modelling 34(10): 2685–2697.
Sutton, R. S.; et al. 1998. Introduction to reinforcement
learning, volume 135. MIT press Cambridge.
Van Moffaert, K.; and Nowé, A. 2014. Multi-objective rein-
forcement learning using sets of pareto dominating policies.
The Journal of Machine Learning Research 15(1): 3483–
3512.
Wang, N. 2005. Modeling and analysis of massive social
networks. Ph.D. thesis, UMD.
Weitz, J. S.; and Dushoff, J. 2015. Modeling post-death
transmission of Ebola: challenges for inference and oppor-
tunities for control. Scientific reports 5: 8751.