Robust Lock-Down Optimization for COVID-19 Policy Guidance Ankit Bhardwaj2* , Han Ching Ou1* , Haipeng Chen1 , Shahin Jabbari1 , Milind Tambe1 , Rahul Panicker2 , Alpan Raval2 1 Harvard University, hou@g.harvard.edu, {hpchen, jabbari , tambe}@seas.harvard.edu 2 WadhwaniAI, {bhardwaj, rahul, alpan}@wadhwaniai.org ∗ Equal Contribution Abstract In addition to insufficient data and uncertainty, for our As the COVID-19 outbreak continues to pose a serious world- problem, it is hard to specify a single objective that one wide threat, numerous governments choose to establish lock- wants to achieve. It is likely that many objectives need to be downs in order to reduce disease transmission. However, im- met for the task to be considered successful. For example, posing the strictest possible lock-down at all times has dire we may want to delay the peak of infections while making economic consequences, especially in areas with widespread sure that our hospitals are not overburdened or our economy poverty. In fact, many countries and regions have started is not affected too severely. The decision maker in this case charting paths to ease lock-down measures. Thus, planning is looking at the problem from several perspectives leading efficient ways to tighten and relax lock-downs is a crucial and to many possible objectives that the model will be evaluated urgent problem. We develop a reinforcement learning based on. Thus, in this work, we incorporate multi-objective func- approach that is (1) robust to a range of parameter settings, tions. and (2) optimizes multiple objectives related to different as- pects of public health and economy, such as hospital capacity The main contributions of this work can be summarized and delay of the disease. The absence of a vaccine or a cure as follows: for COVID to date implies that the infected population cannot • We formulate the problem of lock-down implementation be reduced through pharmaceutical interventions. However, as a Markov Decision Process (MDP). To solve this MDP, non-pharmaceutical interventions (lock-downs) can slow dis- we propose a Reinforcement Learning (RL) approach that ease spread and keep it manageable. This work focuses on how to manage the disease spread without severe economic optimizes the trade-off between health objectives and eco- consequences. nomic cost. • We tackle the uncertainty in environment parameters that Introduction might arise from the noise in the data and the process of estimation by considering different robust approaches. While governments are responding to the spread of COVID- 19 by imposing lock-downs of varying intensity to reduce • We analyse different robust approaches including uni- human-human contact, the situation cannot be maintained form sampling and adversarial sampling during the train- indefinitely. Each day of lock-down brings severe economic ing phase. We find that there is a trade-off relation in the loss affecting the livelihood of billions. Thus, it is imperative average-case and worst-case between RL agents with dif- to use the available resources of interventions – lock-downs, ferent degrees of risk aversion. test-kits, ventilators etc., in an efficient manner. This work • We design different health objectives that might be of in- aims to find optimal lock-down policies based on epidemio- terest to decision-makers and measure our performance logical models and reinforcement learning. along these different objectives simultaneously. Reinforcement learning has shown promising results on sequential decision making tasks like Go (Silver et al. 2016) With this work, we aim to address the challenging task of and autonomous driving (Pan et al. 2017). On these tasks, planning temporal resource allocation for lock-downs. The training from real-world data directly is too expensive due models that we use for modelling the spread of COVID are to costly data collection process. Learning the agent model the SEIR class of epidemiological models. from simulations is thus necessary. However, simulations don’t reflect the real world exactly due to uncertainties trans- Previous Work ferred while fitting the simulation model (Christiano et al. Since as early as the 17th century, when Bernoulli pro- 2016). It can be dangerous if the policy learned is unaware posed the first mathematical epidemic model for small- of such uncertainties, which is especially true for our task. pox (Bernoulli and Blower 2004), there have been numer- AAAI Fall 2020 Symposium on AI for Social Good. ous efforts in the modeling and control of epidemics. One Copyright © 2020 for this paper by its authors. Use permitted under important class of these models are called compartmental Creative Commons License Attribution 4.0 International (CC BY models. These models, as their name suggests, divide the 4.0). population into different health states (compartments) and model transitions of populations between these health states. Notations Definition The underlying assumption is that these compartments have Health States homogeneously mixed populations. Susceptible-Exposed- S Susceptible fraction of the population Infected-Recovered (SEIR) family of models are compart- E Exposed fraction of the population mental models with dynamics described by ordinary differ- I Infected fraction of the population ential equations. Recently, there have been advances in fit- R The fraction of the population that is ting SEIR models with machine learning techniques (Bannur either completely recovered or is un- et al. 2021). In this work, we use a Susceptible-Exposed- dergoing recovery and is no longer in- Infected-Recovered-Deceased (SEIRD) model (detailed de- fectious scription in subsection Epidemic Model) to model the D Deceased fraction of the population COVID-19 data. However, the technique we propose is ap- Transmission plicable to any of the SEIR family models. t Current time (day) Apart from epidemic modeling, the problem of optimiz- Ro Basic Reproductive Number ing cure and control for preventing the spread of disease is Tinc The incubation time also of interest. However, most works in the computer sci- Tinf The duration for individuals being in- ence literature usually assume an idealistic model, such as fectious every contact being known, no uncertainty in the disease pa- Trecover Time for individuals to recover or rameters or a strong cure/isolation that guarantees the re- quarantine in hospital covery of the individuals (Ball, Knock, and O’Neill 2015; Tf atal Time for a fatal infection individual to Sun and Hsieh 2010; Wang 2005; Zhang and Prakash 2015; die Ganesh, Massoulié, and Towsley 2005). None of these are Intervention true for most real-world diseases, such as the newly arisen a Action (intervention strength) COVID-19 pandemic which has no cure as of the writing of l(a) Cost of the action per day this paper. Even under most settings being ideal, a small un- e Lock-down effectiveness coefficient certainty could have serious implications on outcomes if not d Minimum duration of the intervention handled properly. For example, the impact of curing uncer- Ttrans Transition delay after lock-down a is tainty under perfect observation is analyzed in Hoffman and deployed Caramanis (Hoffmann and Caramanis 2018) by providing Objectives non-constructive, algorithm-independent bounds. We aim to λ Hospital Capacity address the challenging setting in which there are uncertain- δ Economic-Health cost weight ties in most of the parameters in the model. Robust control is a branch of control theory that has a long Table 1: The notations used across this paper. history. In particular, robustness toward parameter uncer- tainty results in a performance drop from the model toward its real-world application(Mannor et al. 2004). Numerous ties are often considered and solved when the system is easy works (Nilim and El Ghaoui 2005, 2004; White III and El- to describe (Censor 1977). Multi-Objective Reinforcement deib 1994) have tried to tackle such uncertainty under the ro- Learning however (Roijers et al. 2013; Van Moffaert and bust MDP framework with different assumptions. In recent Nowé 2014), is a relatively new research area that has been years, Reinforcement learning has demonstrated promising actively studied only in recent years. In this work, we de- results on a variety of MDP problems (Silver et al. 2016; signed a Quality Adjusted Life Year (QALY) value function Pan et al. 2017). For applications with a high safety require- to calculate the suitable reward signal for any point of any ment, it is natural to combine robustness into reinforcement two objectives of the QALY variant. Such a function can learning (Mihatsch and Neuneier 2002; Carpin, Chow, and also be used to determine the difficulties of optimizing the Pavone 2016; Chow et al. 2017). Among these works, using two objectives simultaneously. an adversarial agent to adjust the environment and discover potential risk systematically has shown promising results in Modelling many real-world tasks (Pinto, Davidson, and Gupta 2017; Pattanaik et al. 2017). A recent algorithm using an adversar- Epidemic Model ial framework is robust adversarial reinforcement learning For modelling COVID-19, we adopt a discrete time SEIRD (RARL) (Pinto et al. 2017) in which, two agents are trained, model (Weitz and Dushoff 2015). The SEIRD class of mod- one protagonist and other an adversary providing attacks on els is a part of compartmental models, as mentioned above. input states and dynamics. In our work, similar to RARL, An individual can be in one of the following health states: we use an adversarial agent to systematically search risky S (a healthy individual susceptible to disease), E (the indi- environmental parameters for the policy. vidual has been exposed and has latent disease), or I (the Another important consideration for lock-down policy individual is infected), R (the individual is recovering and is makers might be to include different desirable objectives no longer infectious to others) and D (the individual is de- in their decision-making. Multi-objective optimization has ceased). Table 1 summarizes the symbols we use throughout tremendous practical importance in many real-life applica- this paper. tions (Deb 2014). Linear combinations of Pareto optimali- The discrete-time dynamics equations for our epidemic model are: of age, Social distancing of entire population and Closure of schools and universities for non-pharmaceutical interven- −St It tions in British population, which all have different cost and St+1 − St = , (1) effectiveness. Ttrans (a, e) These lock-down policies should enjoy several desider- St Et ata for real-world deployment. First, each type of interven- Et+1 − Et = ( − ), (2) Ttrans (a, e) Tinc tion needs to last for a minimum duration d. Second, since Et It the lock-down has economic cost, government bodies would It+1 − It = ( − ), (3) expect a trade-off between the total budget B spent for plan- Tinc Tinf ning and policy deployment and public health related gains. It To model such interventions, we considered lock-downs as a Rt+1 − Rt = , (4) Trecover series of action choices. The decision maker can plan differ- Rt ent policies in different time periods based on the limitations Dt+1 − Dt = , (5) mentioned above. Tf atal During the lock-down period in India, the change in es- 1 1 1 timated transmission time as the effect of interventions has in which Tinf = Trecover + Tf atal and the basic repro- ductive number can be obtained from R0 = Tinf /Ttrans . been observed based on (Group 2020). This corresponds to A typical SEIRD model described above starts from a popu- Ttrans in the SEIRD model we proposed and the effective- lation being mostly susceptible and a small fraction of in- ness e varies in different regions. Thus we modeled the ac- fectious people. When R0 > 1, each infected individual tion as extending the transmission time to different degrees will infect more than one susceptible individual in its life- and different costs per day. Such a sequence of actions forms time on average. Each susceptible individual will eventually an intervention vector a of length T as a planning schedule go through the exposed, infectious to finally recovered or with a total cost sum. deceased states. A schematic diagram of the SEIRD model is shown in Figure 1. Generally in compartmental models, Multi-Objective Functions In the public health domain, governments and decision mak- ers may want to achieve different objectives when deploying a policy. One direct objective could be eliminating the dis- ease which can be achieved by suppressing contacts so that patients recover at a rate greater than the spread of infection. This is equivalent to minimizing the area under the infection curve. However, this is not achievable in many regions, in- cluding cities in many of the developing countries due to the Figure 1: A schematic diagram of the SEIRD model. huge economic cost of such strict lock-downs. Thus, we fo- cus on economically sustainable interventions that do not re- Ttrans is a constant. However, in a real-world setting, the duce R0 below 1. In epidemic theory, this means the disease transmission time can be reduced through the deployment cannot be eliminated within reasonable time no matter how of non-pharmaceutical lock-down interventions. Note that the government plans the lock-down in these regions. Every there is no direct reduction in infected population as there is susceptible individual will eventually go through the recov- no cure or vaccination available. We only consider the lock- ered or deceased state. In other words, although the infection down interventions that increase the transmission time of the curve will change, the area under the curve will remain the virus based on their strength. same. Compartmental models and their variants are commonly To evaluate the effectiveness of lock-down policies un- used in disease state forecasting and prediction. For con- der these circumstances, we use indirect objectives that are creteness, in this work, state populations and numerical val- vital and achievable for sustainable interventions. For exam- ues of the transmission parameters are based on the available ple, as there is limited hospital capacity, a patient’s qual- data for the city of Mumbai (Group 2020). However, it is ity of life will likely be better when the infected population worth noting that both the intervention model and planning does not exceed that capacity and they can receive proper algorithm we propose apply for most, if not all, of the SEIR treatment. Alternatively, we may want to delay the infec- model variants. tion to the point when we have better system preparedness, medicines, resources etc. for handling the disease. These dif- Intervention Modelling ferent desired objectives can be described as a family of ob- As there is no cure/vaccine available for COVID-19 till date, jective functions that are variants of the Quality Adjusted models of pharmaceutical interventions are not applicable. Life Year (QALY) score in our model, which is elaborated To manage the rapid disease spread, different lock-down below. policies can be considered to limit individual contacts. For QALY is a popular established metric to quantify the ef- example, (Ferguson et al. 2020) considered different lev- fectiveness of health interventions. It is often used in the els of lock-down such as Case isolated at home, Voluntary public health literature (Salomon et al. 2012). It measures home quarantine, Social distancing of those over 70 years the effectiveness of a certain intervention by combining quantity and quality of health improvement. Specifically, a more effectiveness. For simplicity, we adopt a linear map- person’s life quality at any given time is mapped to [0, 1], ping for both cost and effectiveness, as: with quality 1 corresponding to perfect health while 0 corre- sponding to death. QALY accumulates such measurements Tinf Ttrans (a, e) = (1 + ec(a)) (9) over time as its final score. Naturally, different disease con- R0 ditions lie in the range [0, 1] depending on severity. and c : A → [0, 1], in which e is the lock-down effec- In this work, we change the time scale from years to days tiveness coefficient and R0 the basic reproduction number to adapt to the dynamics of the disease we are facing. We when there are no lock-down interventions. Both of these mainly focus mainly on two objectives, burden and delay. are estimated with data from the city of Mumbai, India. We define these two objective functions as: Ttrans and Tinf are the transmission and infection time pe- X riods. OBurden = ((I(t) − λ)1I(t)>λ − δBurden l(at )) (6) For the remaining tuple, the transition distribution t X P(s′ |s, a) is described as the disease transmission equa- ODelay = (tI(t) − δDelay l(at )). (7) tion 1 to 5. The total accumulated reward is exactly the ob- t jective function O in equations 6, 7. The next section de- where t refers to timestep. at , I(t) and l(at ) refer to ac- scribes the distribution of individual reward signals across tion, infected population fraction and cost of action at time states. t respectively. Also δ and λ refer to economy-health weight and hospital capacity and 1 is the indicator function. Here, Reinforcement Learning Approach we focus on optimizing a linear combination of these two Multiple Objectives: We have defined the state, action, objective functions, written as: transition probabilities and total reward in the MDP sec- tion. The only missing piece for a complete reinforcement Omix (w) = wŌBurden + (1 − w)ŌDelay , (8) learning framework is to design the reward signal at every where the weight w ranges from 0 to 1 and Ō is O normal- timestep. We have designed a framework that works not only ized by the absolute value of no intervention, i.e., we divide for the two example objectives we focus in this work, but on O by its absolute value in the absence of interventions. most variants of QALY. Most variants of QALY, including our examples, are re- Formulation Using MDP lated to time and population of certain health states. We pro- pose a function we call the QALY value V (x, t) which is a Our lock-down control problem can be modeled as a Markov function of the population x in a certain health state and time Decision Process (MDP) (Yang, Sun, and Narasimhan t. We focus only on I or the Infectious state in these exper- 2019). Over the last two decades, reinforcement learn- iments. However, the QALY value function can be gener- ing (Sutton et al. 1998) has provided an effective framework alized to a vector form to include multiple states. For con- for solving an MDP in both theory and application. This is trolling the hospital capacity, the function can be formulated especially true when the system dynamics is either compli- either as a constant penalty for x exceeding the capacity or cated, unknown, or the state dimensionality is too high for simply as a reward for x below the threshold, since the area classical optimal control methods. In addition, the environ- under the infection curve is a constant, as we elaborate in ment parameters we estimate from real-world hospital data section Multi-Objective Functions. We formalize this as: involve uncertainty that cannot be ignored. Thus the output policy needs to be robust to such uncertainty. VBurden (x, t) = 1 for x < λ (10) We thus consider a parameter-wise robust reinforcement learning model to solve the MDP framework. The MDP As for delay, we formalize this function as framework can be written as: t VDelay (x, t) = (11) hS, A, P, Ri T with state space S, action space A, and transition distribu- Given that the QALY value V (x, t) of the objective func- tion and vector reward tion is defined, the reward signal at any given time t can be R I(t) calculated by r(t) = 0 V (x, t)dt. We can thus apply the P(s′ |s, a) for s, s′ ∈ S and a ∈ A reinforcement learning approach. r(s) ∈ R One benefit of such a proposed approach is that the QALY and the preference weight w ∈ Rn . value function of the mixed objective can be easily calcu- The states we consider are the fractions of population lated as: present in S,E,I,R and D compartments at the given time. wVBurden (1 − w)VDelay Furthermore, we consider several discrete actions at each Vmix (w) = + OBurden (no action) ODelay (no action) time step corresponding to different strengths of lock-down (12) with different costs. It is natural to assume the strength to be monotonically increasing with the cost as otherwise the ac- Uncertainty: Another important aspect other than having a tion choice will be dominated by actions with less cost but multi-objective function in the lock-down application is the uncertainty of the parameters (e, Tinf , Tinc ), which are re- Model Performance on Burden lated to the infection curve directly or indirectly. We exper- Model Worst Mean Std iment with three approaches to analyze the effect of uncer- Random -3.625 -2.030 0.500 tainty in a reinforcement learning setup: FRL -2.385 -1.234 0.372 (1)Fixed RL(FRL): Train the RL agent using only the mean DRL -2.445 -1.313 0.467 of the uncertain parameters. ARL -2.226 -1.279 0.385 (2)Distributed RL(DRL): Train the RL agent using sam- ples of uncertain parameters from the estimated range. Table 2: Burden objective in equation 6. (3)Adversarial RL(ARL): Inspired by (Pinto et al. 2017), train the RL agent with another adversarial RL agent that Model Performance on Delay will maliciously pick the worst possible parameter set for the Model Worst Mean Std RL agent during training. Note that the worst case parame- Random -759.522 2.110 186.413 ter is not trivial to find as the policy changes. The action of FRL -29.486 163.403 39.201 the adversarial RL agent is set to be the discrete uncertain DRL -100.538 235.391 110.176 parameters in the disease model. ARL 9.933 189.307 50.813 Experiments Table 3: Delay objective in equation 7. In this section, we describe the application of our method to a specific location – the city of Mumbai, India. In subse- quent subsections, we describe how we estimate the model the performance of different methods on their corresponding parameters as well as the uncertainty in these parameters. worst-case parameters. We also find the corresponding aver- We also report the results of our method when used on the age performance over the parameter distribution. The results estimated parameter ranges. are tabulated in Tables 2 and 3. As shown in these tables, the ARL helps the reinforcement learning discover risky param- Parameter and Uncertainty Estimation eters and thus performs best in its worst case scenario. For We fit our SEIRD model to the time-series data from the average case, however, ARL performs worse than the best COVID19-India API (Group 2020) for the city of Mumbai. method (FRL and DRL respectively). This has shown the The data is aggregated in fields called Recovered, Deceased, trade-off between performance and robustness in our lock- Hospitalized and Total Infected, where Total Infected = Re- down problem - at the cost of average performance, we can covered + Hospitalized + Deceased. In our SEIRD model, obtain better worst-case performance. we fit the I compartment to Total Infected, D compartment Different objectives: We use different weights between to Deceased and R compartment to Hospitalized + Recov- Burden and Delay objectives and compare the results to the ered. In this sense, the R compartment in our model esti- case when we individually focus on Delay and Burden in mates people who are either under recovery or have recov- Table 4. The objective function we use is (1 − w) ∗ Delay + ered, and thus are no longer infectious. w ∗ Burden for different values of w. The aim is to maximize We decided the initial search space for the model pa- normalized objective for both Burden and Delay. rameters based on the estimates given by public health ex- From Table 4, we observe that, as expected, as the weight perts and those cited in literature. We process the data with on Burden increases, the Burden objective becomes larger smoothing techniques to reduce the effect of bulk data en- for all methods in general. Similar behaviour is observed for try. Then, we search over the parameter space for parameter Delay as well. When applying this method for policy guid- sets that have a small aggregated RMSE loss between pre- ance, we can tune w to achieve the required objectives for dicted numbers and actual numbers using the Hyperopt li- both Burden and Delay. brary (Bergstra, Yamins, and Cox 2013). The parameter set giving the least loss value is taken to be the best-fit parame- ter set for the purposes of this experiment. Conclusions and Future Work We found that there are diverse parameter sets that have We implemented reinforcement learning on the lock-down loss close to the best-fit parameter set. Thus, we picked all policy optimization problem for COVID-19 while consider- parameter sets that have a loss within a certain range of the ing important real-world aspects like robustness and multi- best loss (within 10%). Among all picked parameter sets, objective optimization. Robustness can be achieved by intro- we find the range of values taken by individual parameters. ducing an adversarial agent for parameter discovery, but at These ranges for individual parameters give us a measure the cost of sacrificing some performance on average. For the of uncertainty for these parameters. We assume a uniform multi-objective mixture, we study the trade-off between con- distribution over these ranges as our parameter distribution. trolling hospital capacity and delaying the infection spread. We proposed a reward distribution framework for the rein- Analysis and Results forcement learning agent to shift from one objective to an- Robustness: Robustness of policy to uncertainty in parame- other in the lock-down problem. One point to note is that our ters is an important aspect. Over the estimated uniform dis- epidemiological model (SEIRD) is a homogeneous model tribution range, we find the worst-case parameters for dif- and is being used to optimize the policy keeping the trade-off ferent methods using a fine grid-search. Then, we measure between economy and health for the community as a whole. Model w Burden Delay Mixed References Random -1.074 0.675 0.675 FRL -1.372 1.478 1.478 Ball, F. G.; Knock, E. S.; and O’Neill, P. D. 2015. Stochas- 0.0 tic epidemic models featuring contact tracing with delays. DRL -1.155 1.711 1.711 ARL -1.442 1.727 1.727 Mathematical biosciences 266: 23–35. Random -1.230 0.843 0.318 Bannur, N.; Maheshwari, H.; Jain, S.; Shetty, S.; Merugu, FRL -1.017 1.154 0.611 S.; and Raval, A. 2021. Adaptive COVID-19 Forecasting 0.25 DRL -1.086 1.673 0.983 via Bayesian Optimization. In Proceedings of the ACM ARL -0.993 1.103 0.579 India Joint International Conference on Data Science and Random -1.073 0.641 -0.216 Management of Data, CoDS-COMAD ’21. New York, NY, FRL -0.622 1.207 0.293 USA: Association for Computing Machinery. doi:10.1145/ 0.5 DRL -0.738 1.196 0.229 3430984.3431047. URL https://doi.org/10.1145/3430984. ARL -0.912 1.208 0.148 3431047. Random -1.019 0.670 -0.597 FRL -0.550 1.090 -0.140 Bergstra, J.; Yamins, D.; and Cox, D. 2013. Making a sci- 0.75 ence of model search: Hyperparameter optimization in hun- DRL -0.818 1.159 -0.324 ARL -0.703 0.932 -0.294 dreds of dimensions for vision architectures. In Interna- Random -1.193 0.587 -1.193 tional conference on machine learning, 115–123. FRL -0.734 1.400 -0.734 Bernoulli, D.; and Blower, S. 2004. An attempt at a new 1.0 DRL -0.620 1.014 -0.620 analysis of the mortality caused by smallpox and of the ad- ARL -0.671 1.001 -0.671 vantages of inoculation to prevent it. Reviews in medical virology 14(5): 275–288. Table 4: Model Performance for Mixed Objectives. The scores are calculated based on equation 12 Carpin, S.; Chow, Y.-L.; and Pavone, M. 2016. Risk aversion in finite Markov Decision Processes using total cost criteria and average value at risk. In 2016 ieee international confer- ence on robotics and automation (icra), 335–342. IEEE. The model does not discriminate between two infected in- dividuals based on their economic contribution and neither Censor, Y. 1977. Pareto optimality in multiobjective prob- is it capable for the same. This makes sure that we generate lems. Applied Mathematics and Optimization 4(1): 41–59. lockdown policy as fairly as possible. Chow, Y.; Ghavamzadeh, M.; Janson, L.; and Pavone, M. The future direction of this work is to gather more data 2017. Risk-constrained reinforcement learning with per- on both cost and effectiveness of the real-world lock-down centile risk criteria. The Journal of Machine Learning Re- policies on community scale so that a more complex model search 18(1): 6070–6120. can be used to better estimate the real-world scenarios. For example, transmission times are known to not be homoge- Christiano, P.; Shah, Z.; Mordatch, I.; Schneider, J.; Black- neous and several super-spreader events have been identified well, T.; Tobin, J.; Abbeel, P.; and Zaremba, W. 2016. Trans- in many different spreading routes. Collecting data on such fer from simulation to real world through learning deep in- cases and modifying the model to have different transmis- verse dynamics model. arXiv preprint arXiv:1610.03518 . sion times for different cases of spread would give us a more Deb, K. 2014. Multi-objective optimization. In Search holistic view of the entire scenario. Another important direc- methodologies, 403–449. Springer. tion of extension would be estimating the reporting rate from other sources of data and normalizing the reported numbers Ferguson, N.; Laydon, D.; Nedjati-Gilani, G.; Imai, N.; to estimate parameters that are closer to the real-world. Ainslie, K.; Baguelin, M.; Bhatia, S.; Boonyasiri, A.; Cu- cunubá, Z.; Cuomo-Dannenburg, G.; et al. 2020. Report 9: Impact of non-pharmaceutical interventions (NPIs) to re- Acknowledgements duce COVID19 mortality and healthcare demand. Imperial College London 10: 77482. This study is made possible by the generous support of the American People through the United States Agency Ganesh, A.; Massoulié, L.; and Towsley, D. 2005. The effect for International Development (USAID) and Army Re- of network topology on the spread of epidemics. In Pro- search Office (ARO). The work described in this article ceedings IEEE 24th Annual Joint Conference of the IEEE was implemented under the TRACETB Project, managed Computer and Communications Societies., volume 2, 1455– by WIAI under the terms of Cooperative Agreement Num- 1466. IEEE. ber 72038620CA00006 and by Teamcore, CRCS, Harvard Group, C.-. I. O. D. O. 2020. Accessed on yyyy-mm-dd University under Multidisciplinary University Research Ini- from https://api.covid19india.org/. tiative grant number W911NF1810208. The contents of this manuscript are the sole responsibility of the authors and do Hoffmann, J.; and Caramanis, C. 2018. The Cost of Un- not necessarily reflect the views of USAID, ARO or the certainty in Curing Epidemics. Proceedings of the ACM on United States Government. Measurement and Analysis of Computing Systems 2(2): 31. Mannor, S.; Simester, D.; Sun, P.; and Tsitsiklis, J. N. 2004. White III, C. C.; and Eldeib, H. K. 1994. Markov decision Bias and variance in value function estimation. In Proceed- processes with imprecise transition probabilities. Opera- ings of the twenty-first international conference on Machine tions Research 42(4): 739–749. learning, 72. Yang, R.; Sun, X.; and Narasimhan, K. 2019. A gener- Mihatsch, O.; and Neuneier, R. 2002. Risk-sensitive rein- alized algorithm for multi-objective reinforcement learning forcement learning. Machine learning 49(2-3): 267–290. and policy adaptation. In Advances in Neural Information Nilim, A.; and El Ghaoui, L. 2004. Robustness in Markov Processing Systems, 14636–14647. decision problems with uncertain transition matrices. In Ad- Zhang, Y.; and Prakash, B. A. 2015. Data-aware vaccine al- vances in neural information processing systems, 839–846. location over large networks. ACM Transactions on Knowl- Nilim, A.; and El Ghaoui, L. 2005. Robust control of edge Discovery from Data (TKDD) 10(2): 20. Markov decision processes with uncertain transition matri- ces. Operations Research 53(5): 780–798. Pan, X.; You, Y.; Wang, Z.; and Lu, C. 2017. Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952 . Pattanaik, A.; Tang, Z.; Liu, S.; Bommannan, G.; and Chowdhary, G. 2017. Robust deep reinforcement learning with adversarial attacks. arXiv preprint arXiv:1712.03632 . Pinto, L.; Davidson, J.; and Gupta, A. 2017. Supervision via competition: Robot adversaries for learning tasks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), 1601–1608. IEEE. Pinto, L.; Davidson, J.; Sukthankar, R.; and Gupta, A. 2017. Robust adversarial reinforcement learning. arXiv preprint arXiv:1703.02702 . Roijers, D. M.; Vamplew, P.; Whiteson, S.; and Dazeley, R. 2013. A survey of multi-objective sequential decision- making. Journal of Artificial Intelligence Research 48: 67– 113. Salomon, J. A.; Vos, T.; Hogan, D. R.; Gagnon, M.; Naghavi, M.; Mokdad, A.; Begum, N.; Shah, R.; Karyana, M.; Kosen, S.; et al. 2012. Common values in assessing health out- comes from disease and injury: disability weights measure- ment study for the Global Burden of Disease Study 2010. The Lancet 380(9859): 2129–2143. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. na- ture 529(7587): 484–489. Sun, C.; and Hsieh, Y.-H. 2010. Global analysis of an SEIR model with varying population size and vaccination. Applied Mathematical Modelling 34(10): 2685–2697. Sutton, R. S.; et al. 1998. Introduction to reinforcement learning, volume 135. MIT press Cambridge. Van Moffaert, K.; and Nowé, A. 2014. Multi-objective rein- forcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research 15(1): 3483– 3512. Wang, N. 2005. Modeling and analysis of massive social networks. Ph.D. thesis, UMD. Weitz, J. S.; and Dushoff, J. 2015. Modeling post-death transmission of Ebola: challenges for inference and oppor- tunities for control. Scientific reports 5: 8751.