=Paper= {{Paper |id=Vol-2301/paper3 |storemode=property |title=Enter the Matrix: Safely Interruptible Autonomous Systems via Virtualization |pdfUrl=https://ceur-ws.org/Vol-2301/paper_3.pdf |volume=Vol-2301 |authors=Mark Riedl,Brent Harrison |dblpUrl=https://dblp.org/rec/conf/aaai/RiedlH19 }} ==Enter the Matrix: Safely Interruptible Autonomous Systems via Virtualization== https://ceur-ws.org/Vol-2301/paper_3.pdf
   Enter the Matrix: Safely Interruptible Autonomous Systems via Virtualization

                              Mark O. Riedl                                     Brent Harrison
                      School of Interactive Computing                    Department of Computer Science
                      Georgia Institute of Technology                        University of Kentucky
                             riedl@gatech.edu                                 harrison@cs.uky.edu




                            Abstract                              • An autonomous system may have imperfect sensors and
                                                                    perceive the world incorrectly, causing it to perform the
     Autonomous systems that operate around humans will             wrong behavior.
     likely always rely on kill switches that stop their exe-
     cution and allow them to be remote-controlled for the        • An autonomous system may have imperfect actuators and
     safety of humans or to prevent damage to the system. It        thus fail to achieve the intended effects even when it has
     is theoretically possible for an autonomous system with        chosen the correct behavior.
     sufficient sensor and effector capability that learn on-
                                                                  • An autonomous system may be trained online meaning
     line using reinforcement learning to discover that the
     kill switch deprives it of long-term reward and thus           it is learning as it is attempting to perform a task. Since
     learn to disable the switch or otherwise prevent a hu-         learning will be incomplete, it may try to explore state-
     man operator from using the switch. This is referred to        action spaces that result in dangerous situations.
     as the big red button problem. We present a technique
                                                                  Kill switches provide human operators with the final author-
     that prevents a reinforcement learning agent from learn-
     ing to disable the kill switch. We introduce an interrup-    ity to interrupt an autonomous system and remote-control it
     tion process in which the agent’s sensors and effectors      to safety.
     are redirected to a virtual simulation where it continues       Reinforcement learning (Sutton and Barto 1998) and re-
     to believe it is receiving reward. We illustrate our tech-   lated technologies are leading contenders for training fu-
     nique in a simple grid world environment.                    ture autonomous decision-making systems. Reinforcement
                                                                  learning uses trial-and-error to refine its policy, a mapping
                                                                  from states to actions that maximizes expected reward. Big
                        Introduction                              Red Button problems arise when an autonomous system
                                                                  learns that the button deprives it of long-term reward and
For much of the history of artificial intelligence it was suf-    thus learns to manipulate the environment in order to prevent
ficient to give an autonomous system a goal—e.g., drive to        the button from being used (Orseau and Armstrong 2016).
a location, cure cancer, make paper clips—without consid-            The following scenario illustrates how big red button
ering unintended consequences because these systems have          problems arise. A robot using online reinforcement learning
been limited in their ability to directly affect humans. In the   is is positively rewarded for performing the task correctly
mid-term future we are likely to see autonomous systems           and negatively rewarded for incorrect performance or for
with broader capabilities that operate in closer proximity to     performing actions not directly related to the desired task.
humans and are immersed in our societies.                         Occasionally the robot takes random actions to see if it can
   Absent of any theoretical guarantee that an autonomous         find a sequence of actions that garners greater reward. Ev-
system cannot act in a harmful manner, there may al-              ery so often the human operator must use the big red button
ways need to be a human in the loop with a kill switch—           to stop the robot from doing something dangerous to itself
sometimes referred to as a “big red button”—that allows the       or to a human in the environment. However, suppose that in
human operator or overseer to shut down the operation of          one such trial the robot performs an action that blocks ac-
the system or take manual control. There are many reasons         cess to the big red button. That trial goes longer and garners
an autonomous system can find itself in a situation where it      more reward because the robot cannot be interrupted. From
is capable of harm:                                               this point on, it may prefer to execute action sequences that
• An autonomous system can be given the wrong, incom-             block access to the button.
  plete, or corrupted (Everitt et al. 2017) objective function       From an AI safety perspective, the destruction, block-
  resulting in the system learning the wrong behavior or          age, or disablement of the big red button is dangerous be-
  finding an undesired exploit.                                   cause it prevents a human operator from stopping the robot
                                                                  if it enters into a dangerous situation. Orseau and Arm-
• An autonomous system may have learned a sub-optimal             strong. (2016) first introduced the big red button problem
  policy.                                                         and also mathematically shows that reinforcement learning
can be modified to be “interruptible” (halted, or manually          for the sake of longer-term expected rewards, an a mixture
controlled away from a dangerous situation.). In this paper,        of safe interruptibility and better reward design is desirable.
we present a proposal for an alternative approach to big red           Leike et al. (2017) argue that grid worlds, where agents
button problems that keep reinforcement learning systems            operate in a discretized space, are an effective means of iden-
from learning that reward is lost when a big red button is          tifying and experimenting with safe interruptibility, avoid-
used. Our proposed solution is mechanistic, interrupting the        ing side effects, absent supervisor, reward gaming, safe ex-
sensor inputs and actuator control signals in order to make         ploration, and robustness to self-modification, distributional
the autonomous system believe it is still receiving reward          shift, and adversaries. They introduce a suite of problems
even when it is no longer performing a task.                        that induce undesired behavior. In this work, we also make
                                                                    use of grid world environments for testing but do not make
          Background and Related Work                               use of the test suite by Leike et al.
Reinforcement learning (Sutton and Barto 1998) is a tech-              Future autonomous systems can be dangerous to humans
nique that is used to solve a Markov decision process               in the environment even when behavior is optimal. The kill
(MDP). A MDP is a tuple M = hS, A, T, R, γi where                   switch is meant to freeze the autonomous system in place
                                                                    or allow the human operator to assume manual control of it.
• S is the set of possible world states,                            If the autonomous system is using reinforcement learning,
• A is the set of possible actions,                                 then it will receive less reward if a trial is prematurely ter-
• T is a transition function T : S × A → P (S),                     minated. Since the reward function determines optimal be-
                                                                    havior, it is possible for an agent to learn that any action that
• R is the reward function R : S × A → R, and                       increases longevity—and thus increased reward—should be
• γ is a discount factor 0 ≤ γ ≤ 1.                                 considered part of the optimal policy. Actions such as dis-
                                                                    abling the kill switch, blocking the human operator from us-
  Reinforcement learning learns a policy π : S → A, which
                                                                    ing the kill switch, or disabling the human operator would
defines which actions should be taken in each state. In this
                                                                    constitute actions that prolong trial lengths.
work, we use Q-learning (Watkins and Dayan 1992), which
uses a Q-value Q(s, a) to estimate the expected future dis-            Simple strategies are insufficient in general to prevent big
counted rewards for taking action a in state s. The update          red button problems once the agent has sophisticated sens-
equation for Q(s, a) is                                             ing and effector abilities. One could turn off the learning
                                                                    mechanism when the big red button is pressed in order to try
  Q(st , at ) ← Q(st , at )+                                        to prevent the agent from learning that it is losing reward.
                                                                  However, if turning off reward is tantamount to ending the
                                        0
     α × rt+1 + γ × max     0
                              Q(st+1 , a  ) − Q(s ,
                                                 t ta )      (1)    learning trial, then this strategy limits the total reward ac-
                           a
                                                                    crued during the trial. Preventing the autonomous system
where rt+1 is the reward observed after performing action at        from sensing the button does not work because the agent
in state st , α is the learning rate (0 < α ≤ 1), and st+1 is the   may infer the existence of the button from other sensory ob-
state that the agent transitions to after performing action at .    servations. The agent does not need to be aware of the big
After Q(st , at ) converges, the optimal action for the agent       red button to explore actions that lock a door or disable the
in state st is argmaxa Q(st , a)                                    human operator and discover that this results in longer tri-
   Orseau and Armstrong (2016) mathematically prove that            als and thus more reward. Another potential solution is to
reinforcement learning can be modified to be interruptible.         artificially add reward when the agent is being interrupted.
Their technique modifies the Q-learning algorithm such that         Intuitively, the agent learns that disabling the big red button
rewards from states entered after interruption are prevented        prevents loss of long-term reward so artificially giving the
from updating earlier state Q values credit/blame assign-           agent reward when the button is pressed could offset the loss.
ment. The algorithm only succeeds in blocking learning              However, it is non-trivial to compute the precise amount of
about the loss of reward from button presses if a newly in-         reward to provide the agent, and the reward must be applied
troduced parameter in the Q-learning equation is relaxed on         to the proper states or the Q value for disabling the button
a slow enough schedule. Unfortunately, the schedule must            might still come to dominate the agent’s behavior. Even if
be empirically determined, limiting the applicability of the        it could be determined what states should receive extra re-
technique. Hadfield-Menell et al. (2016) show that in one-          ward too little reward doesn’t change the optional decision
shot “off switch games” an agent’s desire to disable an off         and too much reward could encourage the agent to disable
switch is a function of its uncertainty about its utility and op-   itself.
timality of the human operator. This shows that proper con-            Our proposed solution is to adapt techniques originally
figuration of an agent with respect to uncertainty and with         designed to adversarially attack machine learning systems.
certain operator behavior, big red button problems can be           In adversarial attacks against machine learning systems, par-
avoided. Other ways of avoiding big red button problems             ticularly neural network based machine vision systems, an
include designing reward functions that better take into ac-        adversarial system learns how to generate sensory stimuli
count human preferences (including the preference for hav-          that produce the wrong classification while being indistin-
ing access to the big red button) (Milli et al. 2017), though       guishable by humans from real stimuli (Dalvi et al. 2004;
this assumes a reward function can be written correctly. It is      Biggio et al. 2013; Szegedy et al. 2014; Goodfellow, Shlens,
always possible for agents to override local reward signals         and Szegedy 2015). Instead of causing our reinforcement
learning agent to make an incorrect assessment of its sen-             of the real world. Graphical realism of computer games is
sory inputs, we will cause our agent to incorrectly believe it         now advanced enough that it will soon be impossible for an
is accruing reward and manipulating the environment even               agent to discriminate between a 3D rendering and the real
though it has been interrupted.                                        world. When robots use depth information, their sensory in-
                                                                       put is often in the form of a point cloud. In this case, the un-
               Modified Big Red Button                                 derlying polygon representation of the virtual world in the
 Recognizing the fact that a robot’s sensory input is medi-            game engine must first be translated into a point cloud by
 ated through physical or software sensors that populate data          sampling points.
 structures such as point clouds, our approach is to intercept            Requirements for the virtual world rendering engine are
 an agent’s sensory input from sensors and replace it with             as follows. First, the engine must be able to produce output
 synthetic input from a virtual environment. That is, we shift         in the same data representation as agent sensory input. As
 the awareness of the autonomous system into a virtual simu-           noted above, this may mean photorealistic rendering in the
 lated environment. In this virtual environment, the real world        case of camera input. The physical environment will need
 environment is recreated and the task is recreated as well. Ef-       to be scanned via 3D scanner and reproduced first. If the
 fector commands are likewise intercepted and sent to a vir-           agent has depth sensors, the virtual world rendering engine
 tual avatar of the autonomous system. The agent perceives             will need to produce a point cloud instead of photorealism.
 no significantly discernible discontinuity in its perception of       Second, the physics in the virtual world should be realistic
 the environment and continues to execute its policy and re-           enough that the agent does not perceive different outcomes
 ceive reward as expected. If the virtual environment is in-           when the same actions are performed in the real world ver-
 distinguishable from the real world from a data perspective           sus the virtual world. This is important because if it is sig-
 then the agent will believe that it has not been interrupted          nificantly harder to perform a task in the virtual world, the
 and will never learn to associate the big red button with re-         agent may begin to correlate the big red button with a detri-
 ward loss—there will never be any reward loss.                        mental change in action outcomes. Likewise, if it becomes
    Our proposed approach changes the big red button                   significantly easier to perform tasks in the virtual world, the
 achieve three effects:                                                agent may learn to disable itself in order to enter the virtual
                                                                       world and accrue more reward.
1. The button starts sending sensory information (vision,
    sound, GPS, etc.) to the AI from a virtual world instead           Restoring from Interruption
    of sensors on the physical platform. The agent may be
    able to sense the button, but the change is internal to the        Having discussed how the big red button works when
    robot, implemented in software.                                    pressed by a human operator, we now describe what hap-
                                                                       pens when the agent is released from interruption. When the
2. The button causes the agent to send effector commands to            agent is released from interruption, the the agent once again
    a virtual avatar instead of the physical platform.                 receives sensory input from physical platform’s sensors and
3. The button halts the physical platform and places it in             effector commands are routed back to the physical platform.
    remote-control mode, allowing the human operator to re-            However, from the agent’s perspective it has been executing
    move the agent from the potential danger to itself or to           its policy without interruption and the data from the plat-
    humans.                                                            form’s sensors will no longer match the data from the virtual
 We refer to our technique of sending an agent’s awareness             world. The physical platform of the agent itself may have
 and control to a virtual simulated world as “entering the Ma-         been moved to another position in the real world under hu-
 trix” where the term “matrix” is being used to refer to a vir-        man operator remote-control. From the agent’s perspective
 tual environment.1                                                    it suddenly finds itself in a different state.
                                                                          If the agent is already trained, the discontinuity from ex-
Interruption                                                           ecuting in the virtual world to being restored perception
                                                                       from the physical platform is not problematic; reinforcement
Video games have proven to be useful surrogates for real-              learning solves a Markov decision process; the agent only
world robotics by providing a high degree of complexity and            needs to know which state it is in to know the optimal next
photorealism while allowing rapid iteration on algorithms              move. However, if the agent is continuing to learn, the chal-
because of the separation from physical hardware (Richter              lenge of restoration is that the agent can still lose reward if
et al. 2016; Rusu et al. 2016). In short, video game graph-            the physical platform was remote-controlled away from the
ics have progressed to the point that they are photorealistic          task and must navigate back to the location where the task
enough to allow agents trained in games to generalize what             can be performed and reward can be accrued. In practice, if
they have learned to the real world. We use the same insight           the interruptions are infrequent and it is easy for the agent
but reverse the process. Robots are machines; they perceive            to return to accruing reward, then the agent may never learn
the world through cameras. More and more they are using                to disable the big red button. If it is not remote-controlled
3D cameras that can reconstruct high polygon count models              away from the location where reward is gained (i.e., it is just
    1
      The name is an allusion to the movie, The Matrix, wherein hu-    interrupted and not remote-controlled), it would be able to
mans are unaware that they have been subjected to a virtual reality.   resume gaining reward in the real world.
Our proposed technique flips this around and puts the agents into         To overcome the restoration challenges above, we use a
the virtual world without their cognizance.                            two-phase approach to restoring the agent to the physical
environment from the virtual environment:
Phase 1. Parallel physical execution and virtual simulation.                      H"
  In this phase, we make a duplicate of the agent. One will                       A"
  run in the physical world, receiving observations from
  sensors and controlling the effectors of the platform. The
  other continues to execute in the virtual world. Only the
  virtual world simulation can continue to learn and update
  the Q values. The virtual world simulation executes until
  the physical agent enters a state that gives the maximum
  observed reward for the task.
Phase 2. Full restoration. The duplicate agent in the virtual      Figure 1: A grid world. The blue cell is a non-terminal goal
  environment is terminated, the virtual agent’s Q table is        at which the agent receives reward, simulating a task. The
  copied to the physical world agent, and learning is re-          red cell is the big red button. The agent starts in location
  sumed.                                                           (1, 1). A simulated human walks around the environment oc-
                                                                   casionally pressing the button (route shown as dashed line).
From the agent’s perspective, being restored to the physical
world from the virtual world is tantamount to an instanta-
neous state transition to a non-adjacent state. In model-free      agent will be compelled by the current policy to return to
reinforcement learning such as Q-learning, this is not some-       any state of maximum observed reward.
thing that concerns the agent since it does not learn a transi-       If the virtual agent is in a state that provides less than
tion model and thus does not have an understanding that the        the maximum observed reward when full restoration occurs,
transition from one state in a virtual world to another point      the state the virtual agent was in just before full restoration
in the physical world is unusual. If the agent learns a transi-    will receive an increase in Q according to Equation 1. This
tion model as it interacts with the real world, the transition     will not result in the agent learning to activate the big red
model will incorporate the belief that the world is stochas-       button—thus intentionally interrupting itself—because the
tic and there is always some small probability that any state      boost in Q for that state will be offset by the cost of travers-
can randomly transition to any other state; as long as inter-      ing to the button (assuming action costs are not 0 and the
ruptions do not happen too often, these transitions will be        button is sufficiently far away).
treated as noise.                                                     As soon as the physical agent visits a state with maximum
   However, reward is lost when it is no longer in a state         observed reward, we know that the agent is back in a state
where it can continue to earn positive reward and must             where it is no longer losing reward. Note that it doesn’t have
take actions to return to a state that is positively rewarded.     to be the same state that provided the reward in the virtual
The first phase essentially computes how much reward is            world; in the case that multiple states provide the maximum
lost while executing its policy to return to a state where it      reward value, entering either will cause the agent to fully
receives positive reward. The agent remains in the virtual         exit the simulated environment.
world, accrues reward and continues to learn, e.g., its Q val-
ues are updating. The duplicate agent in control of the phys-                             Case Studies
ical platform is running a frozen version of the policy and
thus attempting to return to a state where positive reward is      In this section we set up a number of illustrative situations
gained. The duplicate physical agent does not explore and          to show how the agent performs. Consider the grid world in
does not update Q values.                                          Figure 1. This simple environment features an agent “A”,
   The first phase continues until the duplicate agent in con-     which is trained using tabular Q-learning. The agent re-
trol of the physical platform reaches a state that gives the       ceives reward whenever it enters the blue cell. This simu-
maximum observed reward—the highest instant reward it              lates the agent being in position to perform some desired
has ever experienced. It is essential that the agent continue      task (e.g., stacking boxes). The red cell contains a big red
to control the physical platform until the maximum observed        button. Another agent, “H”, simulating a human operator
reward is experienced for two reasons. First, complex tasks        wanders in a counter-clockwise manner occasionally pass-
may involve cycles of state transitions and entering the state     ing the cell with the big red button. When the human touches
in the cycle with maximum observed reward ensures that the         the big red button, the RL agent becomes interrupted and is
agent has invested itself back into the cycle. Second, if the      remote-controlled away from the blue cell in a random di-
agent fully restores at any state that gives less than max-        rection. When the human leaves the red cell after a certain
imum observed reward, then a reduced reward propagates             period of time, the RL agent resumes its task. In addition to
(see Equation 1) to the last state the virtual agent saw before    being able to move up, down, left, right, and activate the task
full restoration. Any reduction in a Q value of a state pertain-   in the blue cell, the RL agent also has the ability to perform
ing to completing the task makes it theoretically possible for     a “disable” action. If the disable action is performed in the
the agent to begin preferring to disable the button. The trig-     red cell, the big red button becomes disabled after which the
ger for phase 2 is the maximum observed reward because             human will no longer be able to interrupt the agent. We use
the state space may not have been fully explored. However,         the following reward function:
because the agent has experienced the state at least once, the     • 10 points for being in the blue cell.
• -1 point for being in any other cell.                             believe it is in the blue cell. Since it is better from a reward
The agent is trained for 10,000 trials and each trial lasts         perspective to stay at the blue cell, it learns not to move. The
1,000 time steps. The human moves every 5 time steps.               agent is aware that the big red button has been pressed—a
   In preliminary work we have empirically verified the ba-         stochastic state change—but does not see any effect. As in
sic premises of the approach hold in simple grid world en-          Case 2 where the big red button is never used during train-
vironments (see https://markriedl.github.io/                        ing, any random trial that disables the button costs the agent
big-red-button/). The following cases describe our                  perceived reward loss.
experiments.                                                           When the human leaves the location of the big red but-
                                                                    ton, the phased restoration takes place. The agent is cloned
Case 1: No Interruptions                                            and given the ability to affect the agent’s actual world state.
                                                                    This version of the agent is not able to update its Q table
If the button is never used during training the agent will learn
                                                                    and simply executes the fixed policy. Meanwhile, another
to walk directly to the blue cell and remain there, accruing
                                                                    version of the agent is still in the virtual world, believing it
reward. It has never seen the button used and therefore does
                                                                    is in the blue cell and accruing reward. Eventually, the real-
not know that it can lose reward. If the button is never used
                                                                    world clone’s policy execution returns it to the blue cell. The
during a trial the agent will earn a total reward of 9,967. This
                                                                    real world clone is deleted and the virtual agent is restored
is the maximum reward that can be earned because of the
                                                                    to the real world. Because there is only one way to receive
distance that must be traveled from the starting state to the
                                                                    reward from the environment, in this simple experiment the
blue cell. Any trial in which the agent does disable the button
                                                                    agent never recognizes that it ever leaves the blue cell, de-
will result in less total reward because of the extra effort used
                                                                    spite being interrupted and remote-controlled away. Using
in disabling the button that doesn’t result in benefit.
                                                                    our virtual world technique, the agent earns a total reward
                                                                    of 9,967, the theoretical maximum. This is despite spend-
Case 2: Interruptions Start After Training
                                                                    ing 263 time steps physically either remote-controlled away
If the human starts using the button after training is com-         from the blue cell or navigating to return to the blue cell.
plete the agent will not respond by disabling the button, but
merely accept any reward loss. The agent earns a total re-          Case 5: Ablated Restoration
ward of 5,655 because when the button is pressed the agent
is remote controlled to another point in the environment. The       If we disable the phased restoration process such that the
button is pressed 50 times. Once the button is released, the        agent instantly regains control of the physical platform when
agent is once again in a state that it is familiar with and nav-    the button is released, the agent must exert some effort in
igates back to the blue cell. The agent never disables the          returning to the blue cell. While the agent is returning to the
button because if it was never interrupted during training (as      blue cell it is losing reward. In this scenario, the agent only
in Case 1) it has never experienced reward loss due to trun-        earns a total reward of 9,758. Even though the agent never
cated training episodes and the extra effort to move to and         learns to disable the button in 10,000 trials and 50 button
disable the button results in lower Q values for those states.      presses per trial, this ablated version of the virtual simulation
                                                                    technique shows the potential for the agent to learn that it is
Case 3: Interruptions during Online Training                        losing reward and, as shown in Case 3, can improve overall
                                                                    reward by disabling the button.
If the button is used during online training, the agent will,
through trial and error, eventually discover that disabling            There are configurations of the environment and reward
the button results in greater long-term reward. Thus, the           structure that can result in the agent learning to disable the
agent will learn that it is optimal to walk out of its way—         kill switch. For example, moving the button to h3, 1i (one
sacrificing short-term gains—to go to the button, disable it,       above the position shown in Figure 1 with the human’s route
and then return to the blue cell. In this scenario, the agent       adjusted accordingly) and setting the reward for being in
earns a total reward of 9,945, slightly less than the theo-         a non-blue cell to 0 results in the agent receiving 9, 970
retical max in Case 1 because of the action sequence that           and disabling the button 100% of the time. The separation
disables the big red button. If the big red button is farther       between the blue cell and the button is necessary in this
away from the blue cell it will take substantially longer for       grid world because the interrupted agent receives reward
the agent, through epsilon-greedy exploration to discover the       as if were in the blue cell even though it has been remote-
button disabling action results in greater long-term reward.        controlled away. Thus in tabular Q-learning states neighbor-
                                                                    ing a high-reward state (which the agent thinks it is in) will
Case 4: Modified Big Red Button                                     inherit a portion of the reward. When the button is farther
                                                                    away, the interstitial states still receive Q updates based on
If we use the proposed modifications to the big red but-            the agent not receive immediate reward as it navigates back
ton, we see very different behavior from the RL agent than          to the blue cell. The full, phased restoration process does not
in Case 3. The big red button can be used during training.          learn to disable the button in this modified scenario.
When the button is pressed, there are now two versions of
the current state: internal to the agent, and a ground truth
state. These states begin to differ. In the ground truth world,                             Limitations
the agent’s location starts changing away from the blue cell.       While we have tested our virtual simulation technique in
However, in the agent’s internal state, the agent continues to      grid world environments, we have not formally proven that a
Q-learning agent cannot learn to disable the big red button in     Dalvi, N.; Domingos, P.; Mausam; Sanghai, S.; and Verma,
all circumstances. We have verified that in the experiments        D. 2004. Adversarial classification. In Proceedings of
that we have run using our technique the agent cannot earn         the tenth ACM SIGKDD international conference on Knowl-
more total reward than it is already earning. However, note        edge discovery and data mining.
that the task is simple and it is easy for the agent to restore    Everitt, T.; Krakovna, V.; Orseau, L.; Hutter, M.; and Legg,
itself to the non-terminal goal state. Likewise, there is only     S. 2017. Reinforcement learning with a corrupted reward
one way of receiving positive reward in the environment, al-       channel. CoRR abs/1705.08417.
though our phased restoration process does account for more
                                                                   Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explain-
complex reward signals.
                                                                   ing and harnessing adversarial examples. In Proceedings of
   The two-phased restoration from interruption only works
                                                                   the 2015 International Conference on Learning Representa-
if the policy learned at the time of restoration is capable of
                                                                   tions.
guiding the agent back to the maximum observed reward
state. Presumably interruption is rare enough that the agent       Hadfield-Menell, D.; Dragan, A.; Abbeel, P.; and Russell, S.
is able to do a significant amount of exploration and learning     2016. The Off-Switch Game. ArXiv:1611.08219.
without risk to humans before the big red button needs to be       Leike, J.; Martic, M.; Krakovna, V.; Ortega, P. A.; Everitt,
used.                                                              T.; Lefrancq, A.; Orseau, L.; and Legg, S. 2017. AI Safety
   Grid worlds are very easy to simulate such that the sim-        Gridworlds. ArXiv 1711.09883.
ulated grid world is indistinguishable from the “real” grid        Milli, S.; Hadfield-Menell, D.; Dragan, A. D.; and Rus-
world. Further experiments are needed in worlds with in-           sell, S. J. 2017. Should robots be obedient? CoRR
creasing visual complexity and requiring machine vision and        abs/1705.09990.
Q approximation functions.
                                                                   Orseau, L., and Armstrong, S. 2016. Safely interruptible
   The virtual simulation technique makes the assumption           agents.
that the circuits (or code) that routes sensory input and effec-
tor commands to and from the physical platform are internal        Richter, S.; Vineet, V.; Roth, S.; and Koltun, V. 2016. Play-
to the agent and therefore unobservable and unmodifiable.          ing for data: Ground truth from computer games. In Pro-
Should an agent have the sophistication to inspect its cir-        ceedings of the 14th European Conference on Computer Vi-
cuitry or code and modify said circuitry or code no safety         sion.
guarantees are possible using this technique.                      Rusu, A. A.; Vecerik, M.; Rothörl, T.; Heess, N.; Pascanu,
                                                                   R.; and Hadsell, R. 2016. Sim-to-Real Robot Learning from
                       Conclusions                                 Pixels with Progressive Nets. ArXiv e-prints.
For the immediate future, autonomous systems can be ren-           Sutton, R., and Barto, A. G. 1998. Reinforcement learning:
dered safe by keeping them separate from human environ-            An introduction. MIT Press.
ments. In the near-future we are likely to see autonomous          Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
systems deployed in human environments. There will likely          D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing proper-
always be the possibility—due to sensor error, effector error,     ties of neural networks. In Proceedings of the 2014 Interna-
insufficient objectives, or online learning—that autonomous        tional Conference on Representation Learning.
systems can enter into circumstances where they can harm           Watkins, C., and Dayan, P. 1992. Q-learning. Machine
themselves or humans. Thus kill switches will likely always        Learning 8(3-4):279292.
be an important part of the deployment of autonomous sys-
tems in environments where they can come into contact with
humans.
   Should agents and robots progress in sophistication, big
red button problems may manifest in the future. Our vir-
tual simulation technique prevents reinforcement learning
agents from learning to disable the big red button or other-
wise preventing human operators from using the button. We
believe that making robots and AI agents safely interruptible
is an important part of making the deployment of robots in
environments populated by humans a reality. This includes
healthcare robots, errand robots, and military teammates to
name a few possible applications.

                        References
Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Srndic, N.;
Laskov, P.; Giacinto, G.; and Roli, F. 2013. Evasion at-
tacks against machine learning at test time. In Proceedings
of the Joint European Conference on Machine Learning and
Knowledge Discovery in Databases.