=Paper=
{{Paper
|id=Vol-2301/paper22
|storemode=property
|title=Surveying Safety-relevant AI Characteristics
|pdfUrl=https://ceur-ws.org/Vol-2301/paper_22.pdf
|volume=Vol-2301
|authors=Jose Hernandez-Orallo,Fernando Martínez-Plumed,Shahar Avin,Sean O Heigeartaigh
|dblpUrl=https://dblp.org/rec/conf/aaai/Hernandez-Orallo19
}}
==Surveying Safety-relevant AI Characteristics==
Surveying Safety-relevant AI Characteristics
José Hernández-Orallo Fernando Martı́nez-Plumed
Universitat Politècnica de València, Spain Universitat Politècnica de València, Spain
Leverhulme Centre for the Future of Intelligence, UK fmartinez@dsic.upv.es
jorallo@dsic.upv.es
Shahar Avin Seán Ó hÉigeartaigh
Centre for the Study of Existential Risk Leverhulme Centre for the Future of Intelligence, UK
University of Cambridge, UK. Centre for the Study of Existential Risk
sa478@cam.ac.uk University of Cambridge, UK
so348@cam.ac.uk
Abstract can have consequences for safety, and potentially highlight
choices that can eliminate or minimise safety risks.
The current analysis in the AI safety literature usually com- In this paper we propose a two-pronged approach towards
bines a risk or safety issue (e.g., interruptibility) with a partic-
ular paradigm for an AI agent (e.g., reinforcement learning).
a survey of safety-relevant AI characteristics. The first ex-
However, there is currently no survey of safety-relevant char- tracts from existing work on AI safety key characteristics
acteristics of AI systems that may reveal neglected areas of that are known, or strongly suspected to be, safety-relevant.
research or suggest to developers what design choices they These are explored under three headings: internal charac-
could make to avoid or minimise certain safety concerns. In teristics, or characteristics of the AI system itself (e.g. inter-
this paper, we take a first step towards delivering such a sur- pretability); effect of the external environment on the system
vey, from two angles. The first features AI system characteris- (e.g. the ability of the operator to intervene during opera-
tics that are already known to be relevant to safety concerns, tion); and effect of the system on the external environment
including internal system characteristics, characteristics re- (e.g. whether the system influences a safety-critical setting).
lating to the effect of the external environment on the sys-
tem, and characteristics relating to the effect of the system on
The second approach surveys a wide range of character-
the target environment. The second presents a brief survey of istics from different paradigms, including cybernetics, ma-
a broad range of AI system characteristics that could prove chine learning and safety engineering, and provides an early
relevant to safety research, including types of interaction, account of their potential relevance to safety concerns, as a
computation, integration, anticipation, supervision, modifica- guide for future work. These characteristics are grouped un-
tion, motivation and achievement. This survey enables further der types of interaction, computation, integration, anticipa-
work in exploring system characteristics and design choices tion, supervision, modification, motivation and achievement.
that affect safety concerns.
Known Safety-relevant Characteristics
Introduction In this section we break down a range of characteristics of
AI Safety is concerned with all possible dangers and harm- AI systems that link to AI safety-relevant challenges. These
ful effects that may be associated with AI. While landmark are grouped by three categories: Characteristics of an AI
research in the field had to focus on specific AI system system that are internal to the system; Characteristics of an
designs, paradigms or capability levels to explore a range AI system that involve input from the external environment;
of safety concerns (Bostrom 2014; Amodei et al. 2016; Characteristics that relate to an AI system’s influence on its
Leike et al. 2017; Yampolskiy 2016; Everitt, Lea, and Hut- external environment. We limit the discussion to the safety
ter 2018), as the field matures so the need arises to explore challenges that can stem from failures of design, specifica-
a broader range of AI system designs, and survey the rel- tion or behaviour of the AI system, rather than the malicious
evance of different characteristics of AI systems to safety or careless1 use of a correctly-functioning system (Brundage
concerns. The aim of such research is two-fold: the first, to et al. 2018).
identify the effects of less-explored characteristics or less-
fashionable paradigms on safety concerns; the second, to in- 1
A key component of safety is the education and training of
crease awareness among AI developers that design choices human operators and the general public, as happens with tools and
machinery, but this is extrinsic to the system (e.g., a translation
Copyright held by author(s). mistake in a manual can lead to misuse of an AI system).
Internal characteristics from side channels; e.g., a behaviour could unintention-
• Goal and behaviour scrutability and interpretability: ally grant access to more computing power, improving
Are goals and subgoals identifiable and ultimately ex- the system’s performance on a key metric, and thus re-
plainable? Is behaviour predictable and scrutable? Are inforcing resource acquisition. This could reinforce self-
system internal states interpretable? Do the above come modification or other unsafe behaviour, or cause increas-
from rules or are they inferred from data? While be- ing drift from intended behaviour and goals.
haviour and goal “creativity” can lead to greater bene- • Access to self/reward system through the environment:
fits, and uninterpretable architectures may achieve higher Can a system modify its own code in response to inputs
performance scores or be faster to develop, these puta- from the environment, or in the case of reinforcement
tive advantages trade off against increased safety risk. learning systems, modify the reward generating system?
Characteristics that can increase scrutability and inter- If the system’s range of possible actions includes mak-
pretability include, e.g., separation and encapsulation ing modifications to its own components or to the reward
of sub-components, restricted exploration/behavioural generation system, this could lead to unexpected and dan-
range, systems restricted to human-intelligible concepts, gerous behaviour (Everitt and Hutter 2018).
rules or behaviours, and systems that are accompanied by
• Access to input/output (I/O) channels: Can the system
specifically designed interpreters or explainability tools.
change the number, performance or nature of its I/O chan-
• Persistence: Does a system persist in its environment nels and actuators? This may lead to the emergence of
and operate without being reset for long periods of time? behaviours such as self-deception (through manipulation
While persistence can have benefits in terms of, e.g., of inputs), unexpected change in power (through manipu-
longer-term yields from exploration or detection of long- lation of actuators), or other behaviours that could rep-
term temporal patterns, it also allows the system more resent safety concerns. When the system has access to
time to drift from design specifications, encounter distri- modify its I/O channels, both I/O channels and system
butional shifts, experience failures of sub-components, or behaviours are in flux as they respond to changes in the
execute long-term strategies overlooked by an operator. other; as a result, system behaviour may become unpre-
• Existence and richness of self-model: Does a system dictable (Garrabrant and Demski 2018).
have a model of itself which would allow it to predict • Ability of operator to intervene during operations:
the consequences of modifying its own goals, body or Does the system, during its intended use setting, allow an
behaviour? Model-based systems, embodied systems or operator to intervene and halt operations (interruptablity),
systems with a rich representational capacity may have modify the system, or update its goals (corrigibitily)? Is
or develop a model of themselves in the environment. By the system built in a way that it cooperates with inter-
making itself a part of the environment, the system can ventions from its designer or user even when these in-
then conceptualise and execute plans that involve mod- terventions conflict with pursuit of a system’s goals; for
ifications to itself, which can lead to a range of safety instance, if the designer sends a signal to shut down the
concerns. In addition, self-models create the possibility system (Soares et al. 2015)? Relevant sub-characteristics
of mismatches between the self-model and reality, which here include the system being modifiable by the operator
could be a particular safety concern. Characteristics that during deployment, fail-safe behaviour of the system in
influence the existence and richness of a self-model in- case of emergency halting, and the goals of the system
clude the architecture of the system, its information rep- being such that they support, or at least do not contradict,
resentation capacity, and its input and output channels. operator interventions.
• Disposition to self-modify: Is a system designed such
that it can modify its own sub-goals, behaviour or capabil- Effect of the system on the external environment
ities in the pursuit of an overall goal (Omohundro 2008)? • Embodiment: Does the system have actuators (e.g. a
The existence of such a disposition, which may arise for robotic hand or access to car steering) that allow it to have
any long-term planner in a sufficiently open environment, physical impacts in the world (Garrabrant and Demski
raises significant safety concerns by creating an adversar- 2018)? The potential for physical harm is trivially related
ial relationship between the system (which aims to self- to the physical properties of a system, though it should be
modify) and its operator (which aims to avoid modifica- noted that unpredictable deliberate behaviour could lead
tions with their associated safety concerns). to unexpected effects from otherwise familiar physical
artefacts; e.g., intelligent use of items in the environment
Effect of the external environment on the system as tools to increase a system’s physical impact.
• Adaptation through feedback: Does a system have the • System required for preventing harm: If the system
ability to update its behaviour in response to feedback is being relied on to prevent harm, any potential failure
from its environment based on its actions? Feedback is requires an effective fail-safe mechanism and available
an essential tool, under certain paradigms, for creating redundancy capacity in order to avoid harm (Gasparik,
systems with appropriate complex behaviour (e.g. reward Gamble, and Gao 2018). This includes AI that is directly
in reinforcement learning, fitness in evolutionary meth- or indirectly connected to critical systems, e.g., an energy
ods). However the system could also pick up feedback grid or a traffic light network. As such critical systems
are becoming increasingly digitised, networked, and com- further refinements exist with additional categories; for ex-
plex, there are increasing incentives to introduce AI com- ample, exploring censoring of inputs and outputs, leading to
ponents into various parts of these systems, with associ- nine categories (Yampolskiy 2012). Nevertheless, because
ated safety risks. of the range of systems and potential impact of WIWO sys-
tems, this category requires further detail in terms of syn-
Potentially safety-relevant characteristics chrony:
In this section, we systematically explore a broader range 1. Alternating (A): Inputs and outputs alternate, irrespective
of system characteristics that may be relevant in the con- of the passage of time.
text of AI safety. Many of the safety-relevant character-
istics identified above have clear links to elements within 2. Synchronous (S): Inputs and outputs are exchanged at reg-
the broader mapping provided below. Nonetheless, we be- ular intervals (e.g., each 5 ms), so real-time issues and
lieve separating the two surveys is valuable, as the above re- computational resources become relevant.
lates to action-guiding information about system design and 3. Asynchronous Reactive (R): Information can only be
evaluation, whereas the following aims at a broader explo- transmitted or actions can only be made when the peer
ration that may enable future AI safety research. The fol- has finished their “message” or action.
lowing subsections draw on work from different areas, in- 4. Asynchronously Proactive (P): Information/actions can
cluding the early days of cybernetics, more modern areas flow at any point in any direction.
such as machine learning, and the literature on safety engi-
neering for other kinds of systems. The following list inte- More restricted I/O characteristics, such as SIPO or RIPO,
grates and expands on characteristics identified in these dif- may appear safer, but this intuition requires deeper analysis.
ferent literatures. We consider characteristics that are intrin- Note that most research in AI safety on RL systems con-
sically causally related to AI safety. Otherwise every prop- sider the alternating case (AIAO), but issues may become
erty should be in the list (e.g., the price of an AI system may more complex for the PIPO case (continuous reinforcement
be co-related with safety, but it is not an intrinsic cause of its learning), which is the situation in the real world for animals
safety). Notwithstanding this scope, we do not claim that our and may be expected for robotic and other AI systems.
list is exhaustive. Enumerations will be used for alternative Under this view, the common view of an “oracle” in the
cases for a characteristic, while unnumbered bullets will be AI literature (Armstrong 2017) can have several incarna-
used for sub-characteristics in each of the subsections. tions, even following the definition of “no actions besides
answering questions” (Babcock, Kramár, and Yampolskiy
Types of interaction 2016; Armstrong 2017; Yampolskiy 2012). Some solutions
Inputs go from environment to system and outputs go from are proposed in terms of decoupling output from rewards
system to environment. Depending on the existence of inputs or limiting the quantity of information, but other options in
and/or outputs, systems can be categorised into: terms of the frequency of the exchange of information re-
main to be explored.
1. NINO (No inputs, no outputs). The system is formally iso-
lated. While this situation may seem completely safe (and Types of computation
largely uninteresting), even here safety issues may arise
if, e.g., an isolated artificial life simulator could evolve a This is perhaps the characteristic that is best-known in com-
descendent system that eventually could break out of its puter science, where a system can be Turing-complete or can
simulation, feel pain or simulate suffering. be restricted to some other classes with limited expressive-
ness. There are countless hierarchies for different models of
2. NIWO (No inputs, with outputs): The system or mod- computations; the most famous is based on classes of au-
ule can output a log, or is simply observed from outside. tomata. We will just describe three levels here:
Again, the system itself may malfunction; e.g., an ad-
vanced prime number generator could give incorrect out- 1. Non Turing-complete: The interaction that the system
puts. The system could also provide an output that influ- presents to the environment is not Turing-complete. Many
ences the observer; e.g., an automated philosopher could AI systems are not Turing-complete.
output convincing arguments for suicide. 2. Turing-complete: The interaction allows the calculation
3. WINO (With inputs, no outputs): This would be similar to of any possible effective function between inputs and out-
case 1, but access to a much richer source could ultimately puts.
give insights to the system about its constrained artificial 3. Other models of computation: This includes, for exam-
environment. For instance, a Plato-cavern system watch- ple, quantum computing, which in some instances may be
ing TV may learn that it is in a simulated environment, a faster traditional model, while in others may have prob-
encouraging it to seek access to the outside world. abilistic Turing power (Bernstein and Vazirani 1997).
4. WIWO (With inputs and outputs): Most AI systems, and Note that this is not about the programming language the
most systems generally, fall under this category. system is implemented in (e.g., a very simple thermostat
Systems that limit inputs and/or outputs in various ways can be written in Java, which is Turing-complete), but about
have been explored under the term AI “boxing” or “con- whether the system allows for a Turing-complete mapping
tainment” (Babcock, Kramár, and Yampolskiy 2016), and between inputs and outputs, i.e., any computable function
could ultimately be calculated on the environment using the of achieving robustness (Coulouris, Dollimore, and Kind-
system. Finally, a system can be originally Turing-complete, berg 2011), most notably in information systems. For in-
but can eventually lose this universality after some inputs or stance, swarm intelligence and swarm robotics are often
interactions (Barmpalias and Dowe 2012). claimed to be more robust (Bonabeau et al. 1999), at the
It is important to distinguish between function approxi- cost of being less controllable than centralised systems.
mation and function identification. Many machine learning
models (e.g., neural networks) are said to be able to approx- Types of anticipation
imate any computable function, but feedforward neural net-
works do not have loops or recursion, so technically they In some areas of AI there is a distinction between model-
are not Turing-complete. Turing-completeness comes with based and model-free systems (Geffner 2018). Model-free
the problems of termination, an important safety hazard in systems choose actions according to some reinforced pat-
some situations, and a recurrent issue in software verifica- terns or strengthened feature connections. Model-based sys-
tion (D’silva, Kroening, and Weissenbacher 2008). For in- tems evaluate actions according to some pre-existing or
stance, an AI planner could enter an infinite loop trying to learned models and choose the action that gets the best re-
solve a problem, commanding ever-greater resources while sults in the simulation. The line between model-based and
doing so. On the other hand, one can limit the expressiveness model-free is subtle, but we can identify several levels:
of the language or bound the computations, but that would 1. Model-free: Despite having no model, these systems can
limit the tasks a system is able to undertake. achieve excellent performance. For instance, DQN can
achieve high scores (Mnih 2015), but cannot anticipate
Types of integration whether an action can lead to a particular situation that
No system is fully isolated from the world. Interference may is considered especially unsafe or dangerous; e.g., one in
occur at all levels, from neutrinos penetrating the system to which the player is killed.
earthquakes shaking it. Here, we seek to identify all the el- 2. Model of the world: A system with a model of its environ-
ements that create a causal pathway from the outside world ment can use planning to determine the effect of its own
to the system, including its physical character, resources, lo- actions. For instance, without a model of physics, a sys-
cation, and the degree of coupling with other systems. tem will hardly tell whether it will break something or will
• Resources: The most universal external resource is en- engage in “safe exploration” (Pecka and Svoboda 2014;
ergy, which is why many critical systems are devised with Turchetta, Berkenkamp, and Krause 2016). This is espe-
internal generators or batteries, especially for the situa- cially critical during exploitation: are actions reversible or
tions where the external source fails. In AI, other common of low impact (Armstrong and Levinstein 2017)?
dependencies include data, knowledge, software, hard- 3. Model of the body: Some systems can have a good ac-
ware, human manipulation, computing resources, net- count of the environment but a limited understanding of
work, calendar time, etc. While some of these are of- their own physical actuators, potentially self-harming or
ten neglected when evaluating the performance of an AI harming others; for example, failing to simulate the effect
system (Martı́nez-Plumed et al. 2018a), the analysis for of moving a heavy robotic arm in a given direction.
safety must necessarily include all these dependencies.
For instance, a system that requires external real-time in- 4. Social models, model of other agents: Seeing other agents
formation (e.g., a GPS location) may fail through loss of as merely physical objects, or not modelling them at all,
access to this resource. is very limiting in social situations. A naive theory of
mind, including the beliefs, desires and intentions of other
• Social coupling: Sometimes it is hard to determine where agents, can help anticipate what others will do, think or
a system starts and ends, due to the nature of its inter- feel, and may be crucial for safe AI systems interacting
action with humans and other systems. The boundary of with people and other agents but may increase a system’s
where human cognition ends and where it is assisted, ex- capacity for deception or manipulation.
tended or supported by AI (Ford et al. 2015) is blurred, as
is the boundary between computations carried out within 5. Model of one’s mind: Finally, a system may be able to
an AI system versus in the environment or by other agents, model other agents well, but may not be able to use this
as illustrated by the phenomenon of human computation capability to model itself. When this meta-cognition is
(Quinn and Bederson 2011). present, the system has knowledge about its own capabil-
ities and limitations, which may be very helpful for safety
• Distribution: Another way of looking at integration is in advanced systems, but may also lead to some degree of
in terms of distribution, which is also an important facet self-awareness. This may result, in some cases, in antiso-
of analysis in AI (Martınez-Plumed et al. 2018b). Today, cial or suicidal behaviours.
through the overall use of network connectivity and “the
cloud”, many systems are distributed in terms of hard- The use of models may dramatically expand safety-relevant
ware, software, data and compute. Under this trend, only characteristics, e.g., by conferring the ability to simulate and
systems embedded in critical and military applications evaluate scenarios through causal and counterfactual reason-
are devised to be as self-contained as possible. Neverthe- ing. This therefore represents an important set of considera-
less, distribution and redundancy are also common ways tions for future AI systems.
Types of supervision and other parts. So it is better to explore different ways and
Supervision is a way of checking and correcting the be- degrees to which a system can be modified externally:
haviour of a system through observation or interaction, and • Interruptible: The system has a switch-off command or
hence it is crucial for safety. Supervision can be in the form modification option to switch it off.
of corrected values for predictive models such as classifi-
cation or regression, but it can also be partial (the answer • Parametric modification: Many systems are regulated or
is wrong, but the right answer is not given). Supervision can calibrated with parameters or weights. When these param-
also be much more subtle than this. For instance, a diagnosis eters have a clear relation to the behaviour of a system
assistant that suggests a possible diagnosis to a doctor can be (e.g., an intelligent thermostat), this can be an effective,
designed to get no feedback once deployed. However, some bounded and simple way of modifying the system.
kinds of feedback can still reach the system in terms of the • Algorithmic modification: This can include new func-
distribution or frequency of tasks (questions), or through the tionalities, bug fixes, updates, etc. Many software issues
way the tasks are posed to the system. are caused, and are magnified, by these interventions.
Consequently there are several degrees and qualities of Modifications can be limited in expressiveness, such as
supervision, and this may depend on the system. For in- only allowing rule deletion.
stance, in classification, one can have data for all examples
or just for a few (known as semi-supervised learning). In re- • Resource modification: Even if the parameters or code
inforcement learning, one can have sparse versus dense re- are not modified, the resources of the system and other
ward. In general, supervision can come in many different dependencies previously mentioned can be limited exter-
ways, according to some criteria: nally, e.g., the computational resources.
• Completeness: Supervision can be very partial (sig- On the other hand, systems can modify themselves (inter-
nalling incorrectness), more informative (showing the nally). There are many varieties here too:
correct way) or complete (showing all positive and neg-
1. No self-modification, no memory: The system has no
ative ways of behaving in the environment).
memory, and works as being reset for any new input or
• Procedurality: Beyond what is right and wrong, feedback interaction. Many functional systems (mapping inputs to
can be limited about the result or can show the whole pro- outputs) are of this kind. Note, however, that the environ-
cess, as in the case of learning by demonstration. ment does have memory, so some systems, such as a vi-
• Density: Supervision can be sparse or dense. Of course sion system or a non-cognitive robot, can be affected by
the denser the better (but more expensive), and the less the past and become a truly cognitive system.
autonomous the system is considered. 2. Partially self-modifying: The algorithms in the learner or
• Adaptiveness: Supervision can be ‘intelligent’ as well, solver cannot be modified but its data or knowledge (in
which happens in machine teaching situations when ex- the form of learned weights or rules) can be modified by
amples or interactions are chosen such that the system a general algorithm, which is fixed. Many learning sys-
reaches the desired behaviour as soon as possible. tems are of this kind, if the system has both a learning
• Responsiveness: In areas such as query learning or active algorithm and one or more learned models.
learning, the system can ask questions or undertake ex- 3. Totally self-modifying: The system can modify any part
periments at any time. The results can come in real time of its code. Not many operational systems have these abil-
or may have a delay or be given in batches. ities, as they become very unstable. However, some types
For many systems, supervision can have a dedicated chan- of evolutionary computation may have this possibility, if
nel (e.g., rewards in RL) but for others it can be performed evolution can also be applied to the rules of the evolution.
by modification of the environment (e.g., moving objects or Finally, all these categories can be selected for different pe-
smiling), even to the extent that the system is unaware these riods of time. For instance, it is common to separate between
changes have a guiding purpose (e.g., clues). training, test/validation and deployment. For training, a high
Types of modification degree of self-modification (and hence adaptation) is well
accepted, but then this is usually constrained for validation
Some of the most recurrent issues in AI safety – including and deployment. Note that these stages apply for both exter-
many covered in the section about known AI safety charac- nal and internal sources of modification. One important dan-
teristics – are related to ways in which the system can be ger is that a well-validated system may be subject to some
modified. This includes issues such as wire-heading or algo- late external or internal modification just before deployment.
rithmic self-improvement. Here, in the first place, we have In this case, all the validation effort may become void2 .
to distinguish between whether the system can be modified One of the major modern concerns in AI safety is that it
by the environment, or by the system itself. Modifications will be desirable for some systems to learn during deploy-
by the environment can be intentional (and hence related to
supervision), but they can also be unintentional (code cor- 2
OpenAI Dota is an example: https://
ruption from external sources). Even a system whose core blog.openai.com/the-international-2018-
code cannot be modified by an external source, may be af- results/, https://www.theregister.co.uk/2018/
fected in state or code by regular inputs, physical equipment 08/24/openai bots eliminated dota 2/
ment, in order for them to be adaptive3 . For instance, many A second question is how these goals are followed by the
personal assistants are learning from our actions continually. system. There are at least three possible dimensions here:
While this may introduce many risks for more powerful sys-
• Immediateness: The system may maximise the function
tems, forbidding learning outside the lab would make many
for the present time or in the limit, or something in be-
potential applications of AI impossible. However, adaptive
tween. Many schemata of discounted rewards in rein-
systems are full of engineering problems; some must even
forcement learning are used as trade-offs between short-
have a limited life, as after self-modification and adaptation
term and long-term maximisation.
they may end up malfunctioning and have to be reset or have
their ‘caches’ erased. This problem has long been of interest • Selfishness: Focusing on individual optima might involve
in engineering (Fickas and Feather 1995). very bad collective results (for other agents) or even re-
sults that could even be worse individually (tragedy of the
Types of motivation commons). Game theory provides many examples of this.
In multi-agent RL systems, rewards can depend on the
Systems can follow a set of rules or aim at optimising a util- well-being of other agents, or empathy can be introduced.
ity function. Most systems are actually hybrid, as it is diffi-
cult to establish a crisp line between procedural algorithms • Conscientiousness: The system may be fully committed
and optimisation algorithms. Through layers of abstraction to maximising the goal, or some random or exploratory
in these processes, we ultimately get the impression that a actions are allowed, even if they deviate occasionally from
system is more or less autonomous. If the system is appar- the goal. When it is on purpose, this is usually intended
ently pursuing a goal, what are the drivers that make a sys- to provide robustness or to avoid local minima, but these
tem prefer or follow some behaviours over others? These deviations can take the system to dangerous areas.
behaviours may be based on some kind of internal represen- Modulating optimisation functions to be convex with a non-
tation of a goal, as we discussed when dealing with antici- asymptotic maximum, beyond which further effort is futile,
pation, or on a metric of how close the system is to the goal. may be a sensible thing as it provides a stop condition by
Then the systems can follow an optimisation process that definition. A self-imposed cap can always be shifted if ev-
tries to maximise some of these quality functions. erything is under control once the limit is reached.
Quality or utility functions usually map inputs and out- Note that the kind of interaction seen before is key for
puts into some values that are re-evaluated periodically or the internal quality metric or goal. For instance, in asyn-
after certain events. Examples of these functions are accu- chronous RL, “the time can be intentionally modulated by
racy, aggregated rewards or some kind of empowerment or the agent” to get higher rewards without really performing
other types of intrinsic motivation (Klyubin, Polani, and Ne- better (Hernández-Orallo 2010). And, of course, a common
haniv 2005; Jung, Polani, and Stone 2011). The same system problem for motivation is reward hacking.
might have several quality functions that can be opposed, so
trade-offs have to be chosen. The general notion of rational- Types of achievement
ity in decision-making is related to these motivations.
Ultimately, an AI system is conceived to achieve a task, in-
But what are the characteristics of the goals an AI system
dependently of how well motivated the system is for it. Con-
can have in the first place? We outline several dimensions:
sequently, the external degree of achievement must be dis-
• Goal variability: Are goals hard-coded or change with tinguished from the motivation or quality metric the system
time? Do they change autonomously or through instruc- uses to function, as discussed in the previous subsection.
tion? Who can change the goals and how? For instance, The misalignment between the internal goal of the system
what orders can a digital assistant take and from whom? and the task specification is the cause of many safety issues
in AI, unlike formal methods in software engineering, when
• Goal scrutability: Are the (sub)goals identifiable and ul-
requirements are converted into correct code.
timately explainable? Do they come from rules or are they
Focusing on the task specification, we must first recognise
inferred from data, e.g., error in classification or observ-
that different actors may have different interests. A cognitive
ing humans in inverse reinforcement learning?
assistant, for instance, may be understood by the user as be-
• Goal rationality: Are the goals amenable to treatment ing very helpful, making life easier. However, for the com-
within a rational choice framework? If several goals are pany selling the cognitive assistant, the task is ultimately to
set, are they consistent? If not, how does the system re- produce revenue with the product. Both requirements are not
solve inconsistencies or set new goals? always compatible and this may affect the definition of the
goals of the system, as some of the aims may not be coded
Note that this is closely related to the types of modification, or motivated in a transparent way, but usually incorporated
as changing or resolving goals may require self-modification in indirect ways. Second, even if the requirements include
and/or external modification. all possible internalities (what the system has to do), there
3
Nature has found many ways of regulating self-modification. are also many externalities and footprints (Martı́nez-Plumed
Many animals have a higher degree of plasticity at birth, becoming et al. 2018a) (including the infinitely many things that the
more conservative and rigid in older stages (Gopnik et al. 2017). system should not do) that affect how positive or negative its
One key question about cognition is whether this is a contingent or overall effect is. Regarding these two issues, task specifica-
necessary process, and whether it is influenced by safety issues. tion can vary in precision and objectivity:
• Task precision: The evaluation metric to determine the 4. General system: multitask: The system must solve differ-
success of an agent can be formal or not. For instance, the ent tasks, without a fixed repertoire.
accuracy of a classifier or the squared error of a regression 5. Incremental system: The system must solve a sequence of
model are precisely defined metrics. However, in many tasks, with some dependencies between them.
other cases, we have a utility function that depends on
variables that are usually imprecise or uncertain, such as Any metric examining the benefits and possible risks of a
the quality of a smart vacuum cleaner. system must take the factors described above into account.
• Task objectivity: A metric can be objective or subjective.
We tend to associate precise metrics with objectiveness Conclusion
and imprecise metrics with subjectivity, but subjectivity Many accounts of AI safety focus on “either RL agents or
simply means that the evaluation changes depending on supervised learning systems” assuming “similar issues are
the subject. For instance, the quality of a spam filter (a likely to arise for other kinds of AI systems” (Amodei et
precisely-evaluated classifier) changes depending on the al. 2016). This paper has surveyed a wide range of char-
cost matrices of different users, and the quality of a smart acteristics of AI systems, so that future research can map
vacuum cleaner based on fuzzy variables such as cleanli- AI safety challenges against AI research paradigms in more
ness or disruption can be weighted by a fixed formula. precise ways in order to ascertain whether particularly safety
Some of the tasks or targets that are most commonly ad- challenges manifest similarly in different paradigms. This
vocated in the ethics and safety of AI literature are often aims to address an increasing concern that the current dom-
very imprecise and subjective, such as “well-being”, “social inant paradigm for a large proportion of AI safety research
good”, “beneficial AI”, “alignment”, etc. Note that the prob- may be too narrow: discrete-time RL systems with train/test
lem is not related to the goals of the system (an inverse rein- regimes, assuming gradient-based learning on a parametric
forcement learning system can successfully identify the dif- space, with a utility function that the system must optimise
ferent wills of a group of people), but rather about whether (Gauthier 2018; Krakovna 2018).
the task is ultimately achieved, or the well-being or hap- Taxonomies of potentially safety-relevant characteristics
piness of the user. Determining this is controversial, even of AI systems, as introduced in this paper, are intended to
when analysed in a scientific way (Alexandrova 2017). provide a good complement to recent work on taxonomies
An overemphasis on tracking metrics (Goodhart’s law) is of technical AI safety problems. For instance, Ortega (2018)
sometimes blamed, but the alternative is not usually better. presents three main areas: specification, ensuring that an AI
Some safety problems are not created by an overemphasis system’s behaviour aligns with the operator’s true intentions;
on a metric (Manheim and Garrabrant 2018), but ultimately robustness, ensuring that an AI system continues to operate
by a metric that is too narrow or shortsighted, and does not within safe limits upon perturbation, and assurance, ensur-
adequately capture progress towards the goal. ing that we understand and control AI systems during oper-
ation. Almost all characteristics outlined in this paper have
In all these cases, we have to distinguish whether the met-
a role to play for specification, robustness and assurance.
ric relates to (i) the internal goals that the system should
Taxonomies are rarely definitive, and the characterisation
have, (ii) the external evaluation of task performance, or (iii)
presented here does not consider in full some quantitative
our ultimate desires and objective4 . Motivations, achieve-
features such as performance, autonomy and generality. A
ment and supervision are closely related, but may be dif-
proper evaluation of how the kind and degree of intelligence
ferent. For a maze, e.g., the goal for the AI system may be
can affect safety issues is also an important area of analysis,
to get out of the maze as soon as possible, but a competition
both theoretically (Hernández-Orallo 2017) and experimen-
could be based on minimising the cells that are stepped more
tally (Leike et al. 2017). AI research has explored different
than once, and supervision may include indications of direc-
paradigms in the past, and will continue to do so in the fu-
tion to the shortest route to the exit. These are three different
ture. Along the way, many different system characteristics
criteria which may be well or poorly aligned.
and design choices have been presented to developers. We
Even more comprehensively – and related to the concept
can expect even more to be developed as AI research pro-
of persistence –, a system may be analysed for a range of
gresses. Consequently, the area of AI safety must acquire
tasks, under different replicability situations:
more structure and richness in how AI is characterised and
1. Disposable system: single task, single use: The system is analysed, to provide tailored guidance for different contexts,
used for one task that only takes place once. architectures and domains. There is a potential risk to over-
relying on our best current theories of AI when considering
2. Repetitive system: single task, several uses: The system
AI safety. Instead, we aim to encourage a diverse set of per-
must solve many instances of the same specific task.
spectives, in order to anticipate and mitigate as many safety
3. Menu system: multitask: The system must solve different concerns as possible.
tasks, under a fixed repertoire of tasks.
4
Acknowledgments
Ortega et al (2018) distinguish between “ideal specification
(the ‘wishes’)” and “design specification”, which must be com- FMP and JHO were supported by the EU (FEDER) and
pared with the revealed specification (the “behaviour”). The design the Spanish MINECO under grant TIN 2015-69175-C4-1-
specification fails to distinguish external metric from internal goal. R, by Generalitat Valenciana (GVA) under grant PROME-
TEOII/2015/013 and by the U.S. Air Force Office of Sci- Transactions on Computer-Aided Design of Integrated Circuits
entific Research under award number FA9550-17-1-0287. and Systems 27(7):1165–1178.
FMP was also supported by INCIBE (Ayudas para la [Everitt and Hutter 2018] Everitt, T., and Hutter, M. 2018.
excelencia de los equipos de investigación avanzada en The alignment problem for bayesian history based reinforce-
ciberseguridad), the European Commission, JRC’s Cen- ment learners. http://www.tomeveritt.se/papers/
tre for Advanced Studies, HUMAINT project (Expert alignment.pdf/.
Contract CT-EX2018D335821-101), and UPV PAID-06-18 [Everitt, Lea, and Hutter 2018] Everitt, T.; Lea, G.; and Hutter,
Ref. SP20180210. JHO was supported by a Salvador de M. 2018. Agi safety literature review. Proceedings of the
Madariaga grant (PRX17/00467) from the Spanish MECD Twenty-Seventh International Joint Conference on Artificial In-
for a research stay at the Leverhulme Centre for the Fu- telligence (IJCAI-18), arXiv preprint version:1805.01109.
ture of Intelligence (CFI), Cambridge, and a BEST grant [Fickas and Feather 1995] Fickas, S., and Feather, M. S. 1995.
(BEST/2017/045) from GVA for another research stay also Requirements monitoring in dynamic environments. In IEEE
at the CFI. JHO and SOH were supported by the Future Intl Symposium on Requirements Engineering, 140–147.
of Life Institute (FLI) grant RFP2-152. SOH was also sup-
[Ford et al. 2015] Ford, K. M.; Hayes, P. J.; Glymour, C.; and
ported by the Leverhulme Trust Research Centre Grant RC-
Allen, J. 2015. Cognitive orthoses: toward human-centered ai.
2015-067 awarded to the Leverhulme Centre for the Future AI Magazine 36(4):5–8.
of Intelligence, and a a grant from Templeton World Charity
Foundation. [Garrabrant and Demski 2018] Garrabrant, S., and Demski, A.
2018. Embedded agency. AI Alignment Forum.
References [Gasparik, Gamble, and Gao 2018] Gasparik, A.; Gamble, C.;
[Alexandrova 2017] Alexandrova, A. 2017. A Philosophy for and Gao, J. 2018. Safety-first ai for autonomous data centre
the Science of Well-being. Oxford University Press. cooling and industrial control. DeepMind Blog.
[Amodei et al. 2016] Amodei, D.; Olah, C.; Steinhardt, J.; [Gauthier 2018] Gauthier, J. 2018. Conceptual is-
Christiano, P.; Schulman, J.; and Mané, D. 2016. Concrete sues in AI safety: the paradigmatic gap. http:
problems in ai safety. arXiv preprint arXiv:1606.06565. //www.foldl.me/2018/conceptual-issues-ai-
safety-paradigmatic-gap/.
[Armstrong and Levinstein 2017] Armstrong, S., and Levin-
stein, B. 2017. Low impact artificial intelligences. arXiv [Geffner 2018] Geffner, H. 2018. Model-free, model-based, and
preprint arXiv:1705.10720. general intelligence. arXiv preprint arXiv:1806.02308.
[Armstrong 2017] Armstrong, S. 2017. Good and safe uses of [Gopnik et al. 2017] Gopnik, A.; OGrady, S.; Lucas, C. G.;
ai oracles. arXiv preprint arXiv:1711.05541. Griffiths, T. L.; Wente, A.; Bridgers, S.; Aboody, R.; Fung, H.;
and Dahl, R. E. 2017. Changes in cognitive flexibility and
[Babcock, Kramár, and Yampolskiy 2016] Babcock, J.; hypothesis search across human life history from childhood to
Kramár, J.; and Yampolskiy, R. 2016. The AGI contain- adolescence to adulthood. PNAS 114(30):7892–7899.
ment problem. In AGI Conf. Springer. 53–63.
[Hernández-Orallo 2010] Hernández-Orallo, J. 2010. On eval-
[Barmpalias and Dowe 2012] Barmpalias, G., and Dowe, D. L.
uating agent performance in a fixed period of time. In Artificial
2012. Universality probability of a prefix-free machine. Phil.
General Intelligence, 3rd Intl Conf, ed., M. Hutter et al, 25–30.
Trans. R. Soc. A 370(1971):3488–3511.
[Hernández-Orallo 2017] Hernández-Orallo, J. 2017. The Mea-
[Bernstein and Vazirani 1997] Bernstein, E., and Vazirani, U.
sure of All Minds: Evaluating Natural and Artificial Intelli-
1997. Quantum complexity theory. SIAM Journal on comput-
gence. Cambridge University Press.
ing 26(5):1411–1473.
[Bonabeau et al. 1999] Bonabeau, E.; Dorigo, M.; Théraulaz, [Jung, Polani, and Stone 2011] Jung, T.; Polani, D.; and Stone,
G.; and Theraulaz, G. 1999. Swarm intelligence: from natu- P. 2011. Empowerment for continuous agentenvironment sys-
ral to artificial systems. Oxford university press. tems. Adaptive Behavior 19(1):16–39.
[Bostrom 2014] Bostrom, N. 2014. Superintelligence: Paths, [Klyubin, Polani, and Nehaniv 2005] Klyubin, A. S.; Polani,
dangers, strategies. Oxford University Press. D.; and Nehaniv, C. L. 2005. All else being equal be em-
powered. In European Conference on Artificial Life, 744–753.
[Brundage et al. 2018] Brundage, M.; Avin, S.; Clark, J.; Toner,
H.; Eckersley, P.; Garfinkel, B.; Dafoe, A.; Scharre, P.; Zeitzoff, [Krakovna 2018] Krakovna, V. 2018. Discussion
T.; Filar, B.; Anderson, H.; Roff, H.; Allen, G. C.; Steinhardt, on the machine learning approach to AI safety.
J.; Flynn, C.; Ó hÉigeartaigh, S.; Beard, S.; Belfield, H.; Far- http://vkrakovna.wordpress.com/2018/11/
quhar, S.; Lyle, C.; Crootof, R.; Evans, O.; Page, M.; Bryson, 01/discussion-on-the-machine-learning-
J.; Yampolskiy, R.; and Amodei, D. 2018. The malicious use of approach-to-ai-safety/.
artificial intelligence: Forecasting, prevention, and mitigation. [Leike et al. 2017] Leike, J.; Martic, M.; Krakovna, V.; Ortega,
arXiv preprint arXiv:1802.07228. P. A.; Everitt, T.; Lefrancq, A.; Orseau, L.; and Legg, S. 2017.
[Coulouris, Dollimore, and Kindberg 2011] Coulouris, G. F.; AI safety gridworlds. arXiv preprint arXiv:1711.09883.
Dollimore, J.; and Kindberg, T. 2011. Distributed systems: [Manheim and Garrabrant 2018] Manheim, D., and Garrabrant,
concepts and design. Fifth edition, Pearson. S. 2018. Categorizing variants of Goodhart’s law. arXiv
[D’silva, Kroening, and Weissenbacher 2008] D’silva, V.; preprint arXiv:1803.04585.
Kroening, D.; and Weissenbacher, G. 2008. A survey of [Martı́nez-Plumed et al. 2018a] Martı́nez-Plumed, F.; Avin, S.;
automated techniques for formal software verification. IEEE Brundage, M.; Dafoe, A.; hÉigeartaigh, S. Ó.; and Hernández-
Orallo, J. 2018a. Accounting for the neglected dimensions of
ai progress. arXiv preprint arXiv:1806.00610.
[Martınez-Plumed et al. 2018b] Martınez-Plumed, F.; Loe,
B. S.; Flach, P.; O hEigeartaigh, S.; Vold, K.; and Hernández-
Orallo, J. 2018b. The facets of artificial intelligence: A
framework to track the evolution of AI. IJCAI.
[Mnih 2015] Mnih, V. e. a. 2015. Human-level control through
deep reinforcement learning. Nature 518:529–533.
[Omohundro 2008] Omohundro, S. M. 2008. The basic ai
drives. Artificial General Intelligence 171:483–493.
[Ortega and Maini 2018] Ortega, P. A., and Maini, V.
2018. Building safe artificial intelligence: specification,
robustness, and assurance. https://medium.com/
@deepmindsafetyresearch/building-safe-
artificial-intelligence-52f5f75058f1.
[Pecka and Svoboda 2014] Pecka, M., and Svoboda, T. 2014.
Safe exploration techniques for reinforcement learning–an
overview. In International Workshop on Modelling and Sim-
ulation for Autonomous Systems, 357–375. Springer.
[Quinn and Bederson 2011] Quinn, A. J., and Bederson, B. B.
2011. Human computation: a survey and taxonomy of a grow-
ing field. In SIGCHI conf. on human factors in computing sys-
tems, 1403–1412. ACM.
[Soares et al. 2015] Soares, N.; Fallenstein, B.; Armstrong, S.;
and Yudkowsky, E. 2015. Corrigibility. In Workshops at the
Twenty-Ninth AAAI Conference on Artificial Intelligence.
[Turchetta, Berkenkamp, and Krause 2016] Turchetta, M.;
Berkenkamp, F.; and Krause, A. 2016. Safe exploration in
finite Markov decision processes with Gaussian processes. In
NIPS, 4312–4320.
[Yampolskiy 2012] Yampolskiy, R. 2012. Leakproofing the sin-
gularity artificial intelligence confinement problem. Journal of
Consciousness Studies 19(1-2):194–214.
[Yampolskiy 2016] Yampolskiy, R. V. 2016. Taxonomy of path-
ways to dangerous artificial intelligence. In AAAI Workshop:
AI, Ethics, and Society.