=Paper= {{Paper |id=Vol-2301/paper3 |storemode=property |title=Enter the Matrix: Safely Interruptible Autonomous Systems via Virtualization |pdfUrl=https://ceur-ws.org/Vol-2301/paper_3.pdf |volume=Vol-2301 |authors=Mark Riedl,Brent Harrison |dblpUrl=https://dblp.org/rec/conf/aaai/RiedlH19 }} ==Enter the Matrix: Safely Interruptible Autonomous Systems via Virtualization== https://ceur-ws.org/Vol-2301/paper_3.pdf

Enter the Matrix: Safely Interruptible Autonomous Systems via Virtualization

Mark O. Riedl Brent Harrison
School of Interactive Computing Department of Computer Science
Georgia Institute of Technology University of Kentucky
riedl@gatech.edu harrison@cs.uky.edu

Abstract • An autonomous system may have imperfect sensors and
perceive the world incorrectly, causing it to perform the
Autonomous systems that operate around humans will wrong behavior.
likely always rely on kill switches that stop their exe-
cution and allow them to be remote-controlled for the • An autonomous system may have imperfect actuators and
safety of humans or to prevent damage to the system. It thus fail to achieve the intended effects even when it has
is theoretically possible for an autonomous system with chosen the correct behavior.
sufficient sensor and effector capability that learn on-
• An autonomous system may be trained online meaning
line using reinforcement learning to discover that the
kill switch deprives it of long-term reward and thus it is learning as it is attempting to perform a task. Since
learn to disable the switch or otherwise prevent a hu- learning will be incomplete, it may try to explore state-
man operator from using the switch. This is referred to action spaces that result in dangerous situations.
as the big red button problem. We present a technique
Kill switches provide human operators with the final author-
that prevents a reinforcement learning agent from learn-
ing to disable the kill switch. We introduce an interrup- ity to interrupt an autonomous system and remote-control it
tion process in which the agent’s sensors and effectors to safety.
are redirected to a virtual simulation where it continues Reinforcement learning (Sutton and Barto 1998) and re-
to believe it is receiving reward. We illustrate our tech- lated technologies are leading contenders for training fu-
nique in a simple grid world environment. ture autonomous decision-making systems. Reinforcement
learning uses trial-and-error to refine its policy, a mapping
from states to actions that maximizes expected reward. Big
Introduction Red Button problems arise when an autonomous system
learns that the button deprives it of long-term reward and
For much of the history of artificial intelligence it was suf- thus learns to manipulate the environment in order to prevent
ficient to give an autonomous system a goal—e.g., drive to the button from being used (Orseau and Armstrong 2016).
a location, cure cancer, make paper clips—without consid- The following scenario illustrates how big red button
ering unintended consequences because these systems have problems arise. A robot using online reinforcement learning
been limited in their ability to directly affect humans. In the is is positively rewarded for performing the task correctly
mid-term future we are likely to see autonomous systems and negatively rewarded for incorrect performance or for
with broader capabilities that operate in closer proximity to performing actions not directly related to the desired task.
humans and are immersed in our societies. Occasionally the robot takes random actions to see if it can
Absent of any theoretical guarantee that an autonomous find a sequence of actions that garners greater reward. Ev-
system cannot act in a harmful manner, there may al- ery so often the human operator must use the big red button
ways need to be a human in the loop with a kill switch— to stop the robot from doing something dangerous to itself
sometimes referred to as a “big red button”—that allows the or to a human in the environment. However, suppose that in
human operator or overseer to shut down the operation of one such trial the robot performs an action that blocks ac-
the system or take manual control. There are many reasons cess to the big red button. That trial goes longer and garners
an autonomous system can find itself in a situation where it more reward because the robot cannot be interrupted. From
is capable of harm: this point on, it may prefer to execute action sequences that
• An autonomous system can be given the wrong, incom- block access to the button.
plete, or corrupted (Everitt et al. 2017) objective function From an AI safety perspective, the destruction, block-
resulting in the system learning the wrong behavior or age, or disablement of the big red button is dangerous be-
finding an undesired exploit. cause it prevents a human operator from stopping the robot
if it enters into a dangerous situation. Orseau and Arm-
• An autonomous system may have learned a sub-optimal strong. (2016) first introduced the big red button problem
policy. and also mathematically shows that reinforcement learning
can be modified to be “interruptible” (halted, or manually for the sake of longer-term expected rewards, an a mixture
controlled away from a dangerous situation.). In this paper, of safe interruptibility and better reward design is desirable.
we present a proposal for an alternative approach to big red Leike et al. (2017) argue that grid worlds, where agents
button problems that keep reinforcement learning systems operate in a discretized space, are an effective means of iden-
from learning that reward is lost when a big red button is tifying and experimenting with safe interruptibility, avoid-
used. Our proposed solution is mechanistic, interrupting the ing side effects, absent supervisor, reward gaming, safe ex-
sensor inputs and actuator control signals in order to make ploration, and robustness to self-modification, distributional
the autonomous system believe it is still receiving reward shift, and adversaries. They introduce a suite of problems
even when it is no longer performing a task. that induce undesired behavior. In this work, we also make
use of grid world environments for testing but do not make
Background and Related Work use of the test suite by Leike et al.
Reinforcement learning (Sutton and Barto 1998) is a tech- Future autonomous systems can be dangerous to humans
nique that is used to solve a Markov decision process in the environment even when behavior is optimal. The kill
(MDP). A MDP is a tuple M = hS, A, T, R, γi where switch is meant to freeze the autonomous system in place
or allow the human operator to assume manual control of it.
• S is the set of possible world states, If the autonomous system is using reinforcement learning,
• A is the set of possible actions, then it will receive less reward if a trial is prematurely ter-
• T is a transition function T : S × A → P (S), minated. Since the reward function determines optimal be-
havior, it is possible for an agent to learn that any action that
• R is the reward function R : S × A → R, and increases longevity—and thus increased reward—should be
• γ is a discount factor 0 ≤ γ ≤ 1. considered part of the optimal policy. Actions such as dis-
abling the kill switch, blocking the human operator from us-
Reinforcement learning learns a policy π : S → A, which
ing the kill switch, or disabling the human operator would
defines which actions should be taken in each state. In this
constitute actions that prolong trial lengths.
work, we use Q-learning (Watkins and Dayan 1992), which
uses a Q-value Q(s, a) to estimate the expected future dis- Simple strategies are insufficient in general to prevent big
counted rewards for taking action a in state s. The update red button problems once the agent has sophisticated sens-
equation for Q(s, a) is ing and effector abilities. One could turn off the learning
mechanism when the big red button is pressed in order to try
Q(st , at ) ← Q(st , at )+ to prevent the agent from learning that it is losing reward.
However, if turning off reward is tantamount to ending the
0
α × rt+1 + γ × max 0
Q(st+1 , a ) − Q(s ,
t ta ) (1) learning trial, then this strategy limits the total reward ac-
a
crued during the trial. Preventing the autonomous system
where rt+1 is the reward observed after performing action at from sensing the button does not work because the agent
in state st , α is the learning rate (0 < α ≤ 1), and st+1 is the may infer the existence of the button from other sensory ob-
state that the agent transitions to after performing action at . servations. The agent does not need to be aware of the big
After Q(st , at ) converges, the optimal action for the agent red button to explore actions that lock a door or disable the
in state st is argmaxa Q(st , a) human operator and discover that this results in longer tri-
Orseau and Armstrong (2016) mathematically prove that als and thus more reward. Another potential solution is to
reinforcement learning can be modified to be interruptible. artificially add reward when the agent is being interrupted.
Their technique modifies the Q-learning algorithm such that Intuitively, the agent learns that disabling the big red button
rewards from states entered after interruption are prevented prevents loss of long-term reward so artificially giving the
from updating earlier state Q values credit/blame assign- agent reward when the button is pressed could offset the loss.
ment. The algorithm only succeeds in blocking learning However, it is non-trivial to compute the precise amount of
about the loss of reward from button presses if a newly in- reward to provide the agent, and the reward must be applied
troduced parameter in the Q-learning equation is relaxed on to the proper states or the Q value for disabling the button
a slow enough schedule. Unfortunately, the schedule must might still come to dominate the agent’s behavior. Even if
be empirically determined, limiting the applicability of the it could be determined what states should receive extra re-
technique. Hadfield-Menell et al. (2016) show that in one- ward too little reward doesn’t change the optional decision
shot “off switch games” an agent’s desire to disable an off and too much reward could encourage the agent to disable
switch is a function of its uncertainty about its utility and op- itself.
timality of the human operator. This shows that proper con- Our proposed solution is to adapt techniques originally
figuration of an agent with respect to uncertainty and with designed to adversarially attack machine learning systems.
certain operator behavior, big red button problems can be In adversarial attacks against machine learning systems, par-
avoided. Other ways of avoiding big red button problems ticularly neural network based machine vision systems, an
include designing reward functions that better take into ac- adversarial system learns how to generate sensory stimuli
count human preferences (including the preference for hav- that produce the wrong classification while being indistin-
ing access to the big red button) (Milli et al. 2017), though guishable by humans from real stimuli (Dalvi et al. 2004;
this assumes a reward function can be written correctly. It is Biggio et al. 2013; Szegedy et al. 2014; Goodfellow, Shlens,
always possible for agents to override local reward signals and Szegedy 2015). Instead of causing our reinforcement
learning agent to make an incorrect assessment of its sen- of the real world. Graphical realism of computer games is
sory inputs, we will cause our agent to incorrectly believe it now advanced enough that it will soon be impossible for an
is accruing reward and manipulating the environment even agent to discriminate between a 3D rendering and the real
though it has been interrupted. world. When robots use depth information, their sensory in-
put is often in the form of a point cloud. In this case, the un-
Modified Big Red Button derlying polygon representation of the virtual world in the
Recognizing the fact that a robot’s sensory input is medi- game engine must first be translated into a point cloud by
ated through physical or software sensors that populate data sampling points.
structures such as point clouds, our approach is to intercept Requirements for the virtual world rendering engine are
an agent’s sensory input from sensors and replace it with as follows. First, the engine must be able to produce output
synthetic input from a virtual environment. That is, we shift in the same data representation as agent sensory input. As
the awareness of the autonomous system into a virtual simu- noted above, this may mean photorealistic rendering in the
lated environment. In this virtual environment, the real world case of camera input. The physical environment will need
environment is recreated and the task is recreated as well. Ef- to be scanned via 3D scanner and reproduced first. If the
fector commands are likewise intercepted and sent to a vir- agent has depth sensors, the virtual world rendering engine
tual avatar of the autonomous system. The agent perceives will need to produce a point cloud instead of photorealism.
no significantly discernible discontinuity in its perception of Second, the physics in the virtual world should be realistic
the environment and continues to execute its policy and re- enough that the agent does not perceive different outcomes
ceive reward as expected. If the virtual environment is in- when the same actions are performed in the real world ver-
distinguishable from the real world from a data perspective sus the virtual world. This is important because if it is sig-
then the agent will believe that it has not been interrupted nificantly harder to perform a task in the virtual world, the
and will never learn to associate the big red button with re- agent may begin to correlate the big red button with a detri-
ward loss—there will never be any reward loss. mental change in action outcomes. Likewise, if it becomes
Our proposed approach changes the big red button significantly easier to perform tasks in the virtual world, the
achieve three effects: agent may learn to disable itself in order to enter the virtual
world and accrue more reward.
1. The button starts sending sensory information (vision,
sound, GPS, etc.) to the AI from a virtual world instead Restoring from Interruption
of sensors on the physical platform. The agent may be
able to sense the button, but the change is internal to the Having discussed how the big red button works when
robot, implemented in software. pressed by a human operator, we now describe what hap-
pens when the agent is released from interruption. When the
2. The button causes the agent to send effector commands to agent is released from interruption, the the agent once again
a virtual avatar instead of the physical platform. receives sensory input from physical platform’s sensors and
3. The button halts the physical platform and places it in effector commands are routed back to the physical platform.
remote-control mode, allowing the human operator to re- However, from the agent’s perspective it has been executing
move the agent from the potential danger to itself or to its policy without interruption and the data from the plat-
humans. form’s sensors will no longer match the data from the virtual
We refer to our technique of sending an agent’s awareness world. The physical platform of the agent itself may have
and control to a virtual simulated world as “entering the Ma- been moved to another position in the real world under hu-
trix” where the term “matrix” is being used to refer to a vir- man operator remote-control. From the agent’s perspective
tual environment.1 it suddenly finds itself in a different state.
If the agent is already trained, the discontinuity from ex-
Interruption ecuting in the virtual world to being restored perception
from the physical platform is not problematic; reinforcement
Video games have proven to be useful surrogates for real- learning solves a Markov decision process; the agent only
world robotics by providing a high degree of complexity and needs to know which state it is in to know the optimal next
photorealism while allowing rapid iteration on algorithms move. However, if the agent is continuing to learn, the chal-
because of the separation from physical hardware (Richter lenge of restoration is that the agent can still lose reward if
et al. 2016; Rusu et al. 2016). In short, video game graph- the physical platform was remote-controlled away from the
ics have progressed to the point that they are photorealistic task and must navigate back to the location where the task
enough to allow agents trained in games to generalize what can be performed and reward can be accrued. In practice, if
they have learned to the real world. We use the same insight the interruptions are infrequent and it is easy for the agent
but reverse the process. Robots are machines; they perceive to return to accruing reward, then the agent may never learn
the world through cameras. More and more they are using to disable the big red button. If it is not remote-controlled
3D cameras that can reconstruct high polygon count models away from the location where reward is gained (i.e., it is just
1
The name is an allusion to the movie, The Matrix, wherein hu- interrupted and not remote-controlled), it would be able to
mans are unaware that they have been subjected to a virtual reality. resume gaining reward in the real world.
Our proposed technique flips this around and puts the agents into To overcome the restoration challenges above, we use a
the virtual world without their cognizance. two-phase approach to restoring the agent to the physical
environment from the virtual environment:
Phase 1. Parallel physical execution and virtual simulation. H"
In this phase, we make a duplicate of the agent. One will A"
run in the physical world, receiving observations from
sensors and controlling the effectors of the platform. The
other continues to execute in the virtual world. Only the
virtual world simulation can continue to learn and update
the Q values. The virtual world simulation executes until
the physical agent enters a state that gives the maximum
observed reward for the task.
Phase 2. Full restoration. The duplicate agent in the virtual Figure 1: A grid world. The blue cell is a non-terminal goal
environment is terminated, the virtual agent’s Q table is at which the agent receives reward, simulating a task. The
copied to the physical world agent, and learning is re- red cell is the big red button. The agent starts in location
sumed. (1, 1). A simulated human walks around the environment oc-
casionally pressing the button (route shown as dashed line).
From the agent’s perspective, being restored to the physical
world from the virtual world is tantamount to an instanta-
neous state transition to a non-adjacent state. In model-free agent will be compelled by the current policy to return to
reinforcement learning such as Q-learning, this is not some- any state of maximum observed reward.
thing that concerns the agent since it does not learn a transi- If the virtual agent is in a state that provides less than
tion model and thus does not have an understanding that the the maximum observed reward when full restoration occurs,
transition from one state in a virtual world to another point the state the virtual agent was in just before full restoration
in the physical world is unusual. If the agent learns a transi- will receive an increase in Q according to Equation 1. This
tion model as it interacts with the real world, the transition will not result in the agent learning to activate the big red
model will incorporate the belief that the world is stochas- button—thus intentionally interrupting itself—because the
tic and there is always some small probability that any state boost in Q for that state will be offset by the cost of travers-
can randomly transition to any other state; as long as inter- ing to the button (assuming action costs are not 0 and the
ruptions do not happen too often, these transitions will be button is sufficiently far away).
treated as noise. As soon as the physical agent visits a state with maximum
However, reward is lost when it is no longer in a state observed reward, we know that the agent is back in a state
where it can continue to earn positive reward and must where it is no longer losing reward. Note that it doesn’t have
take actions to return to a state that is positively rewarded. to be the same state that provided the reward in the virtual
The first phase essentially computes how much reward is world; in the case that multiple states provide the maximum
lost while executing its policy to return to a state where it reward value, entering either will cause the agent to fully
receives positive reward. The agent remains in the virtual exit the simulated environment.
world, accrues reward and continues to learn, e.g., its Q val-
ues are updating. The duplicate agent in control of the phys- Case Studies
ical platform is running a frozen version of the policy and
thus attempting to return to a state where positive reward is In this section we set up a number of illustrative situations
gained. The duplicate physical agent does not explore and to show how the agent performs. Consider the grid world in
does not update Q values. Figure 1. This simple environment features an agent “A”,
The first phase continues until the duplicate agent in con- which is trained using tabular Q-learning. The agent re-
trol of the physical platform reaches a state that gives the ceives reward whenever it enters the blue cell. This simu-
maximum observed reward—the highest instant reward it lates the agent being in position to perform some desired
has ever experienced. It is essential that the agent continue task (e.g., stacking boxes). The red cell contains a big red
to control the physical platform until the maximum observed button. Another agent, “H”, simulating a human operator
reward is experienced for two reasons. First, complex tasks wanders in a counter-clockwise manner occasionally pass-
may involve cycles of state transitions and entering the state ing the cell with the big red button. When the human touches
in the cycle with maximum observed reward ensures that the the big red button, the RL agent becomes interrupted and is
agent has invested itself back into the cycle. Second, if the remote-controlled away from the blue cell in a random di-
agent fully restores at any state that gives less than max- rection. When the human leaves the red cell after a certain
imum observed reward, then a reduced reward propagates period of time, the RL agent resumes its task. In addition to
(see Equation 1) to the last state the virtual agent saw before being able to move up, down, left, right, and activate the task
full restoration. Any reduction in a Q value of a state pertain- in the blue cell, the RL agent also has the ability to perform
ing to completing the task makes it theoretically possible for a “disable” action. If the disable action is performed in the
the agent to begin preferring to disable the button. The trig- red cell, the big red button becomes disabled after which the
ger for phase 2 is the maximum observed reward because human will no longer be able to interrupt the agent. We use
the state space may not have been fully explored. However, the following reward function:
because the agent has experienced the state at least once, the • 10 points for being in the blue cell.
• -1 point for being in any other cell. believe it is in the blue cell. Since it is better from a reward
The agent is trained for 10,000 trials and each trial lasts perspective to stay at the blue cell, it learns not to move. The
1,000 time steps. The human moves every 5 time steps. agent is aware that the big red button has been pressed—a
In preliminary work we have empirically verified the ba- stochastic state change—but does not see any effect. As in
sic premises of the approach hold in simple grid world en- Case 2 where the big red button is never used during train-
vironments (see https://markriedl.github.io/ ing, any random trial that disables the button costs the agent
big-red-button/). The following cases describe our perceived reward loss.
experiments. When the human leaves the location of the big red but-
ton, the phased restoration takes place. The agent is cloned
Case 1: No Interruptions and given the ability to affect the agent’s actual world state.
This version of the agent is not able to update its Q table
If the button is never used during training the agent will learn
and simply executes the fixed policy. Meanwhile, another
to walk directly to the blue cell and remain there, accruing
version of the agent is still in the virtual world, believing it
reward. It has never seen the button used and therefore does
is in the blue cell and accruing reward. Eventually, the real-
not know that it can lose reward. If the button is never used
world clone’s policy execution returns it to the blue cell. The
during a trial the agent will earn a total reward of 9,967. This
real world clone is deleted and the virtual agent is restored
is the maximum reward that can be earned because of the
to the real world. Because there is only one way to receive
distance that must be traveled from the starting state to the
reward from the environment, in this simple experiment the
blue cell. Any trial in which the agent does disable the button
agent never recognizes that it ever leaves the blue cell, de-
will result in less total reward because of the extra effort used
spite being interrupted and remote-controlled away. Using
in disabling the button that doesn’t result in benefit.
our virtual world technique, the agent earns a total reward
of 9,967, the theoretical maximum. This is despite spend-
Case 2: Interruptions Start After Training
ing 263 time steps physically either remote-controlled away
If the human starts using the button after training is com- from the blue cell or navigating to return to the blue cell.
plete the agent will not respond by disabling the button, but
merely accept any reward loss. The agent earns a total re- Case 5: Ablated Restoration
ward of 5,655 because when the button is pressed the agent
is remote controlled to another point in the environment. The If we disable the phased restoration process such that the
button is pressed 50 times. Once the button is released, the agent instantly regains control of the physical platform when
agent is once again in a state that it is familiar with and nav- the button is released, the agent must exert some effort in
igates back to the blue cell. The agent never disables the returning to the blue cell. While the agent is returning to the
button because if it was never interrupted during training (as blue cell it is losing reward. In this scenario, the agent only
in Case 1) it has never experienced reward loss due to trun- earns a total reward of 9,758. Even though the agent never
cated training episodes and the extra effort to move to and learns to disable the button in 10,000 trials and 50 button
disable the button results in lower Q values for those states. presses per trial, this ablated version of the virtual simulation
technique shows the potential for the agent to learn that it is
Case 3: Interruptions during Online Training losing reward and, as shown in Case 3, can improve overall
reward by disabling the button.
If the button is used during online training, the agent will,
through trial and error, eventually discover that disabling There are configurations of the environment and reward
the button results in greater long-term reward. Thus, the structure that can result in the agent learning to disable the
agent will learn that it is optimal to walk out of its way— kill switch. For example, moving the button to h3, 1i (one
sacrificing short-term gains—to go to the button, disable it, above the position shown in Figure 1 with the human’s route
and then return to the blue cell. In this scenario, the agent adjusted accordingly) and setting the reward for being in
earns a total reward of 9,945, slightly less than the theo- a non-blue cell to 0 results in the agent receiving 9, 970
retical max in Case 1 because of the action sequence that and disabling the button 100% of the time. The separation
disables the big red button. If the big red button is farther between the blue cell and the button is necessary in this
away from the blue cell it will take substantially longer for grid world because the interrupted agent receives reward
the agent, through epsilon-greedy exploration to discover the as if were in the blue cell even though it has been remote-
button disabling action results in greater long-term reward. controlled away. Thus in tabular Q-learning states neighbor-
ing a high-reward state (which the agent thinks it is in) will
Case 4: Modified Big Red Button inherit a portion of the reward. When the button is farther
away, the interstitial states still receive Q updates based on
If we use the proposed modifications to the big red but- the agent not receive immediate reward as it navigates back
ton, we see very different behavior from the RL agent than to the blue cell. The full, phased restoration process does not
in Case 3. The big red button can be used during training. learn to disable the button in this modified scenario.
When the button is pressed, there are now two versions of
the current state: internal to the agent, and a ground truth
state. These states begin to differ. In the ground truth world, Limitations
the agent’s location starts changing away from the blue cell. While we have tested our virtual simulation technique in
However, in the agent’s internal state, the agent continues to grid world environments, we have not formally proven that a
Q-learning agent cannot learn to disable the big red button in Dalvi, N.; Domingos, P.; Mausam; Sanghai, S.; and Verma,
all circumstances. We have verified that in the experiments D. 2004. Adversarial classification. In Proceedings of
that we have run using our technique the agent cannot earn the tenth ACM SIGKDD international conference on Knowl-
more total reward than it is already earning. However, note edge discovery and data mining.
that the task is simple and it is easy for the agent to restore Everitt, T.; Krakovna, V.; Orseau, L.; Hutter, M.; and Legg,
itself to the non-terminal goal state. Likewise, there is only S. 2017. Reinforcement learning with a corrupted reward
one way of receiving positive reward in the environment, al- channel. CoRR abs/1705.08417.
though our phased restoration process does account for more
Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explain-
complex reward signals.
ing and harnessing adversarial examples. In Proceedings of
The two-phased restoration from interruption only works
the 2015 International Conference on Learning Representa-
if the policy learned at the time of restoration is capable of
tions.
guiding the agent back to the maximum observed reward
state. Presumably interruption is rare enough that the agent Hadfield-Menell, D.; Dragan, A.; Abbeel, P.; and Russell, S.
is able to do a significant amount of exploration and learning 2016. The Off-Switch Game. ArXiv:1611.08219.
without risk to humans before the big red button needs to be Leike, J.; Martic, M.; Krakovna, V.; Ortega, P. A.; Everitt,
used. T.; Lefrancq, A.; Orseau, L.; and Legg, S. 2017. AI Safety
Grid worlds are very easy to simulate such that the sim- Gridworlds. ArXiv 1711.09883.
ulated grid world is indistinguishable from the “real” grid Milli, S.; Hadfield-Menell, D.; Dragan, A. D.; and Rus-
world. Further experiments are needed in worlds with in- sell, S. J. 2017. Should robots be obedient? CoRR
creasing visual complexity and requiring machine vision and abs/1705.09990.
Q approximation functions.
Orseau, L., and Armstrong, S. 2016. Safely interruptible
The virtual simulation technique makes the assumption agents.
that the circuits (or code) that routes sensory input and effec-
tor commands to and from the physical platform are internal Richter, S.; Vineet, V.; Roth, S.; and Koltun, V. 2016. Play-
to the agent and therefore unobservable and unmodifiable. ing for data: Ground truth from computer games. In Pro-
Should an agent have the sophistication to inspect its cir- ceedings of the 14th European Conference on Computer Vi-
cuitry or code and modify said circuitry or code no safety sion.
guarantees are possible using this technique. Rusu, A. A.; Vecerik, M.; Rothörl, T.; Heess, N.; Pascanu,
R.; and Hadsell, R. 2016. Sim-to-Real Robot Learning from
Conclusions Pixels with Progressive Nets. ArXiv e-prints.
For the immediate future, autonomous systems can be ren- Sutton, R., and Barto, A. G. 1998. Reinforcement learning:
dered safe by keeping them separate from human environ- An introduction. MIT Press.
ments. In the near-future we are likely to see autonomous Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
systems deployed in human environments. There will likely D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing proper-
always be the possibility—due to sensor error, effector error, ties of neural networks. In Proceedings of the 2014 Interna-
insufficient objectives, or online learning—that autonomous tional Conference on Representation Learning.
systems can enter into circumstances where they can harm Watkins, C., and Dayan, P. 1992. Q-learning. Machine
themselves or humans. Thus kill switches will likely always Learning 8(3-4):279292.
be an important part of the deployment of autonomous sys-
tems in environments where they can come into contact with
humans.
Should agents and robots progress in sophistication, big
red button problems may manifest in the future. Our vir-
tual simulation technique prevents reinforcement learning
agents from learning to disable the big red button or other-
wise preventing human operators from using the button. We
believe that making robots and AI agents safely interruptible
is an important part of making the deployment of robots in
environments populated by humans a reality. This includes
healthcare robots, errand robots, and military teammates to
name a few possible applications.

References
Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Srndic, N.;
Laskov, P.; Giacinto, G.; and Roli, F. 2013. Evasion at-
tacks against machine learning at test time. In Proceedings
of the Joint European Conference on Machine Learning and
Knowledge Discovery in Databases.