=Paper=
{{Paper
|id=Vol-2301/paper22
|storemode=property
|title=Surveying Safety-relevant AI Characteristics
|pdfUrl=https://ceur-ws.org/Vol-2301/paper_22.pdf
|volume=Vol-2301
|authors=Jose Hernandez-Orallo,Fernando Martínez-Plumed,Shahar Avin,Sean O Heigeartaigh
|dblpUrl=https://dblp.org/rec/conf/aaai/Hernandez-Orallo19
}}
==Surveying Safety-relevant AI Characteristics==
<pdf width="1500px">https://ceur-ws.org/Vol-2301/paper_22.pdf</pdf>
<pre>
                                  Surveying Safety-relevant AI Characteristics

                         José Hernández-Orallo                                  Fernando Martı́nez-Plumed
               Universitat Politècnica de València, Spain                    Universitat Politècnica de València, Spain
          Leverhulme Centre for the Future of Intelligence, UK                          fmartinez@dsic.upv.es
                          jorallo@dsic.upv.es


                                 Shahar Avin                                           Seán Ó hÉigeartaigh
                   Centre for the Study of Existential Risk              Leverhulme Centre for the Future of Intelligence, UK
                       University of Cambridge, UK.                            Centre for the Study of Existential Risk
                              sa478@cam.ac.uk                                      University of Cambridge, UK
                                                                                          so348@cam.ac.uk


                              Abstract                                      can have consequences for safety, and potentially highlight
                                                                            choices that can eliminate or minimise safety risks.
  The current analysis in the AI safety literature usually com-                In this paper we propose a two-pronged approach towards
  bines a risk or safety issue (e.g., interruptibility) with a partic-
  ular paradigm for an AI agent (e.g., reinforcement learning).
                                                                            a survey of safety-relevant AI characteristics. The first ex-
  However, there is currently no survey of safety-relevant char-            tracts from existing work on AI safety key characteristics
  acteristics of AI systems that may reveal neglected areas of              that are known, or strongly suspected to be, safety-relevant.
  research or suggest to developers what design choices they                These are explored under three headings: internal charac-
  could make to avoid or minimise certain safety concerns. In               teristics, or characteristics of the AI system itself (e.g. inter-
  this paper, we take a first step towards delivering such a sur-           pretability); effect of the external environment on the system
  vey, from two angles. The first features AI system characteris-           (e.g. the ability of the operator to intervene during opera-
  tics that are already known to be relevant to safety concerns,            tion); and effect of the system on the external environment
  including internal system characteristics, characteristics re-            (e.g. whether the system influences a safety-critical setting).
  lating to the effect of the external environment on the sys-
  tem, and characteristics relating to the effect of the system on
                                                                               The second approach surveys a wide range of character-
  the target environment. The second presents a brief survey of             istics from different paradigms, including cybernetics, ma-
  a broad range of AI system characteristics that could prove               chine learning and safety engineering, and provides an early
  relevant to safety research, including types of interaction,              account of their potential relevance to safety concerns, as a
  computation, integration, anticipation, supervision, modifica-            guide for future work. These characteristics are grouped un-
  tion, motivation and achievement. This survey enables further             der types of interaction, computation, integration, anticipa-
  work in exploring system characteristics and design choices               tion, supervision, modification, motivation and achievement.
  that affect safety concerns.

                                                                                 Known Safety-relevant Characteristics
                         Introduction                                       In this section we break down a range of characteristics of
AI Safety is concerned with all possible dangers and harm-                  AI systems that link to AI safety-relevant challenges. These
ful effects that may be associated with AI. While landmark                  are grouped by three categories: Characteristics of an AI
research in the field had to focus on specific AI system                    system that are internal to the system; Characteristics of an
designs, paradigms or capability levels to explore a range                  AI system that involve input from the external environment;
of safety concerns (Bostrom 2014; Amodei et al. 2016;                       Characteristics that relate to an AI system’s influence on its
Leike et al. 2017; Yampolskiy 2016; Everitt, Lea, and Hut-                  external environment. We limit the discussion to the safety
ter 2018), as the field matures so the need arises to explore               challenges that can stem from failures of design, specifica-
a broader range of AI system designs, and survey the rel-                   tion or behaviour of the AI system, rather than the malicious
evance of different characteristics of AI systems to safety                 or careless1 use of a correctly-functioning system (Brundage
concerns. The aim of such research is two-fold: the first, to               et al. 2018).
identify the effects of less-explored characteristics or less-
fashionable paradigms on safety concerns; the second, to in-                   1
                                                                                 A key component of safety is the education and training of
crease awareness among AI developers that design choices                    human operators and the general public, as happens with tools and
                                                                            machinery, but this is extrinsic to the system (e.g., a translation
Copyright held by author(s).                                                mistake in a manual can lead to misuse of an AI system).
Internal characteristics                                          from side channels; e.g., a behaviour could unintention-
• Goal and behaviour scrutability and interpretability:           ally grant access to more computing power, improving
  Are goals and subgoals identifiable and ultimately ex-          the system’s performance on a key metric, and thus re-
  plainable? Is behaviour predictable and scrutable? Are          inforcing resource acquisition. This could reinforce self-
  system internal states interpretable? Do the above come         modification or other unsafe behaviour, or cause increas-
  from rules or are they inferred from data? While be-            ing drift from intended behaviour and goals.
  haviour and goal “creativity” can lead to greater bene-       • Access to self/reward system through the environment:
  fits, and uninterpretable architectures may achieve higher      Can a system modify its own code in response to inputs
  performance scores or be faster to develop, these puta-         from the environment, or in the case of reinforcement
  tive advantages trade off against increased safety risk.        learning systems, modify the reward generating system?
  Characteristics that can increase scrutability and inter-       If the system’s range of possible actions includes mak-
  pretability include, e.g., separation and encapsulation         ing modifications to its own components or to the reward
  of sub-components, restricted exploration/behavioural           generation system, this could lead to unexpected and dan-
  range, systems restricted to human-intelligible concepts,       gerous behaviour (Everitt and Hutter 2018).
  rules or behaviours, and systems that are accompanied by
                                                                • Access to input/output (I/O) channels: Can the system
  specifically designed interpreters or explainability tools.
                                                                  change the number, performance or nature of its I/O chan-
• Persistence: Does a system persist in its environment           nels and actuators? This may lead to the emergence of
  and operate without being reset for long periods of time?       behaviours such as self-deception (through manipulation
  While persistence can have benefits in terms of, e.g.,          of inputs), unexpected change in power (through manipu-
  longer-term yields from exploration or detection of long-       lation of actuators), or other behaviours that could rep-
  term temporal patterns, it also allows the system more          resent safety concerns. When the system has access to
  time to drift from design specifications, encounter distri-     modify its I/O channels, both I/O channels and system
  butional shifts, experience failures of sub-components, or      behaviours are in flux as they respond to changes in the
  execute long-term strategies overlooked by an operator.         other; as a result, system behaviour may become unpre-
• Existence and richness of self-model: Does a system             dictable (Garrabrant and Demski 2018).
  have a model of itself which would allow it to predict        • Ability of operator to intervene during operations:
  the consequences of modifying its own goals, body or            Does the system, during its intended use setting, allow an
  behaviour? Model-based systems, embodied systems or             operator to intervene and halt operations (interruptablity),
  systems with a rich representational capacity may have          modify the system, or update its goals (corrigibitily)? Is
  or develop a model of themselves in the environment. By         the system built in a way that it cooperates with inter-
  making itself a part of the environment, the system can         ventions from its designer or user even when these in-
  then conceptualise and execute plans that involve mod-          terventions conflict with pursuit of a system’s goals; for
  ifications to itself, which can lead to a range of safety       instance, if the designer sends a signal to shut down the
  concerns. In addition, self-models create the possibility       system (Soares et al. 2015)? Relevant sub-characteristics
  of mismatches between the self-model and reality, which         here include the system being modifiable by the operator
  could be a particular safety concern. Characteristics that      during deployment, fail-safe behaviour of the system in
  influence the existence and richness of a self-model in-        case of emergency halting, and the goals of the system
  clude the architecture of the system, its information rep-      being such that they support, or at least do not contradict,
  resentation capacity, and its input and output channels.        operator interventions.
• Disposition to self-modify: Is a system designed such
  that it can modify its own sub-goals, behaviour or capabil-   Effect of the system on the external environment
  ities in the pursuit of an overall goal (Omohundro 2008)?     • Embodiment: Does the system have actuators (e.g. a
  The existence of such a disposition, which may arise for        robotic hand or access to car steering) that allow it to have
  any long-term planner in a sufficiently open environment,       physical impacts in the world (Garrabrant and Demski
  raises significant safety concerns by creating an adversar-     2018)? The potential for physical harm is trivially related
  ial relationship between the system (which aims to self-        to the physical properties of a system, though it should be
  modify) and its operator (which aims to avoid modifica-         noted that unpredictable deliberate behaviour could lead
  tions with their associated safety concerns).                   to unexpected effects from otherwise familiar physical
                                                                  artefacts; e.g., intelligent use of items in the environment
Effect of the external environment on the system                  as tools to increase a system’s physical impact.
• Adaptation through feedback: Does a system have the           • System required for preventing harm: If the system
  ability to update its behaviour in response to feedback         is being relied on to prevent harm, any potential failure
  from its environment based on its actions? Feedback is          requires an effective fail-safe mechanism and available
  an essential tool, under certain paradigms, for creating        redundancy capacity in order to avoid harm (Gasparik,
  systems with appropriate complex behaviour (e.g. reward         Gamble, and Gao 2018). This includes AI that is directly
  in reinforcement learning, fitness in evolutionary meth-        or indirectly connected to critical systems, e.g., an energy
  ods). However the system could also pick up feedback            grid or a traffic light network. As such critical systems
   are becoming increasingly digitised, networked, and com-          further refinements exist with additional categories; for ex-
   plex, there are increasing incentives to introduce AI com-        ample, exploring censoring of inputs and outputs, leading to
   ponents into various parts of these systems, with associ-         nine categories (Yampolskiy 2012). Nevertheless, because
   ated safety risks.                                                of the range of systems and potential impact of WIWO sys-
                                                                     tems, this category requires further detail in terms of syn-
    Potentially safety-relevant characteristics                      chrony:
In this section, we systematically explore a broader range           1. Alternating (A): Inputs and outputs alternate, irrespective
of system characteristics that may be relevant in the con-              of the passage of time.
text of AI safety. Many of the safety-relevant character-
istics identified above have clear links to elements within          2. Synchronous (S): Inputs and outputs are exchanged at reg-
the broader mapping provided below. Nonetheless, we be-                 ular intervals (e.g., each 5 ms), so real-time issues and
lieve separating the two surveys is valuable, as the above re-          computational resources become relevant.
lates to action-guiding information about system design and          3. Asynchronous Reactive (R): Information can only be
evaluation, whereas the following aims at a broader explo-              transmitted or actions can only be made when the peer
ration that may enable future AI safety research. The fol-              has finished their “message” or action.
lowing subsections draw on work from different areas, in-            4. Asynchronously Proactive (P): Information/actions can
cluding the early days of cybernetics, more modern areas                flow at any point in any direction.
such as machine learning, and the literature on safety engi-
neering for other kinds of systems. The following list inte-         More restricted I/O characteristics, such as SIPO or RIPO,
grates and expands on characteristics identified in these dif-       may appear safer, but this intuition requires deeper analysis.
ferent literatures. We consider characteristics that are intrin-        Note that most research in AI safety on RL systems con-
sically causally related to AI safety. Otherwise every prop-         sider the alternating case (AIAO), but issues may become
erty should be in the list (e.g., the price of an AI system may      more complex for the PIPO case (continuous reinforcement
be co-related with safety, but it is not an intrinsic cause of its   learning), which is the situation in the real world for animals
safety). Notwithstanding this scope, we do not claim that our        and may be expected for robotic and other AI systems.
list is exhaustive. Enumerations will be used for alternative           Under this view, the common view of an “oracle” in the
cases for a characteristic, while unnumbered bullets will be         AI literature (Armstrong 2017) can have several incarna-
used for sub-characteristics in each of the subsections.             tions, even following the definition of “no actions besides
                                                                     answering questions” (Babcock, Kramár, and Yampolskiy
Types of interaction                                                 2016; Armstrong 2017; Yampolskiy 2012). Some solutions
Inputs go from environment to system and outputs go from             are proposed in terms of decoupling output from rewards
system to environment. Depending on the existence of inputs          or limiting the quantity of information, but other options in
and/or outputs, systems can be categorised into:                     terms of the frequency of the exchange of information re-
                                                                     main to be explored.
1. NINO (No inputs, no outputs). The system is formally iso-
   lated. While this situation may seem completely safe (and         Types of computation
   largely uninteresting), even here safety issues may arise
   if, e.g., an isolated artificial life simulator could evolve a    This is perhaps the characteristic that is best-known in com-
   descendent system that eventually could break out of its          puter science, where a system can be Turing-complete or can
   simulation, feel pain or simulate suffering.                      be restricted to some other classes with limited expressive-
                                                                     ness. There are countless hierarchies for different models of
2. NIWO (No inputs, with outputs): The system or mod-                computations; the most famous is based on classes of au-
   ule can output a log, or is simply observed from outside.         tomata. We will just describe three levels here:
   Again, the system itself may malfunction; e.g., an ad-
   vanced prime number generator could give incorrect out-           1. Non Turing-complete: The interaction that the system
   puts. The system could also provide an output that influ-            presents to the environment is not Turing-complete. Many
   ences the observer; e.g., an automated philosopher could             AI systems are not Turing-complete.
   output convincing arguments for suicide.                          2. Turing-complete: The interaction allows the calculation
3. WINO (With inputs, no outputs): This would be similar to             of any possible effective function between inputs and out-
   case 1, but access to a much richer source could ultimately          puts.
   give insights to the system about its constrained artificial      3. Other models of computation: This includes, for exam-
   environment. For instance, a Plato-cavern system watch-              ple, quantum computing, which in some instances may be
   ing TV may learn that it is in a simulated environment,              a faster traditional model, while in others may have prob-
   encouraging it to seek access to the outside world.                  abilistic Turing power (Bernstein and Vazirani 1997).
4. WIWO (With inputs and outputs): Most AI systems, and              Note that this is not about the programming language the
   most systems generally, fall under this category.                 system is implemented in (e.g., a very simple thermostat
Systems that limit inputs and/or outputs in various ways             can be written in Java, which is Turing-complete), but about
have been explored under the term AI “boxing” or “con-               whether the system allows for a Turing-complete mapping
tainment” (Babcock, Kramár, and Yampolskiy 2016), and               between inputs and outputs, i.e., any computable function
could ultimately be calculated on the environment using the          of achieving robustness (Coulouris, Dollimore, and Kind-
system. Finally, a system can be originally Turing-complete,         berg 2011), most notably in information systems. For in-
but can eventually lose this universality after some inputs or       stance, swarm intelligence and swarm robotics are often
interactions (Barmpalias and Dowe 2012).                             claimed to be more robust (Bonabeau et al. 1999), at the
   It is important to distinguish between function approxi-          cost of being less controllable than centralised systems.
mation and function identification. Many machine learning
models (e.g., neural networks) are said to be able to approx-     Types of anticipation
imate any computable function, but feedforward neural net-
works do not have loops or recursion, so technically they         In some areas of AI there is a distinction between model-
are not Turing-complete. Turing-completeness comes with           based and model-free systems (Geffner 2018). Model-free
the problems of termination, an important safety hazard in        systems choose actions according to some reinforced pat-
some situations, and a recurrent issue in software verifica-      terns or strengthened feature connections. Model-based sys-
tion (D’silva, Kroening, and Weissenbacher 2008). For in-         tems evaluate actions according to some pre-existing or
stance, an AI planner could enter an infinite loop trying to      learned models and choose the action that gets the best re-
solve a problem, commanding ever-greater resources while          sults in the simulation. The line between model-based and
doing so. On the other hand, one can limit the expressiveness     model-free is subtle, but we can identify several levels:
of the language or bound the computations, but that would         1. Model-free: Despite having no model, these systems can
limit the tasks a system is able to undertake.                       achieve excellent performance. For instance, DQN can
                                                                     achieve high scores (Mnih 2015), but cannot anticipate
Types of integration                                                 whether an action can lead to a particular situation that
No system is fully isolated from the world. Interference may         is considered especially unsafe or dangerous; e.g., one in
occur at all levels, from neutrinos penetrating the system to        which the player is killed.
earthquakes shaking it. Here, we seek to identify all the el-     2. Model of the world: A system with a model of its environ-
ements that create a causal pathway from the outside world           ment can use planning to determine the effect of its own
to the system, including its physical character, resources, lo-      actions. For instance, without a model of physics, a sys-
cation, and the degree of coupling with other systems.               tem will hardly tell whether it will break something or will
• Resources: The most universal external resource is en-             engage in “safe exploration” (Pecka and Svoboda 2014;
  ergy, which is why many critical systems are devised with          Turchetta, Berkenkamp, and Krause 2016). This is espe-
  internal generators or batteries, especially for the situa-        cially critical during exploitation: are actions reversible or
  tions where the external source fails. In AI, other common         of low impact (Armstrong and Levinstein 2017)?
  dependencies include data, knowledge, software, hard-           3. Model of the body: Some systems can have a good ac-
  ware, human manipulation, computing resources, net-                count of the environment but a limited understanding of
  work, calendar time, etc. While some of these are of-              their own physical actuators, potentially self-harming or
  ten neglected when evaluating the performance of an AI             harming others; for example, failing to simulate the effect
  system (Martı́nez-Plumed et al. 2018a), the analysis for           of moving a heavy robotic arm in a given direction.
  safety must necessarily include all these dependencies.
  For instance, a system that requires external real-time in-     4. Social models, model of other agents: Seeing other agents
  formation (e.g., a GPS location) may fail through loss of          as merely physical objects, or not modelling them at all,
  access to this resource.                                           is very limiting in social situations. A naive theory of
                                                                     mind, including the beliefs, desires and intentions of other
• Social coupling: Sometimes it is hard to determine where           agents, can help anticipate what others will do, think or
  a system starts and ends, due to the nature of its inter-          feel, and may be crucial for safe AI systems interacting
  action with humans and other systems. The boundary of              with people and other agents but may increase a system’s
  where human cognition ends and where it is assisted, ex-           capacity for deception or manipulation.
  tended or supported by AI (Ford et al. 2015) is blurred, as
  is the boundary between computations carried out within         5. Model of one’s mind: Finally, a system may be able to
  an AI system versus in the environment or by other agents,         model other agents well, but may not be able to use this
  as illustrated by the phenomenon of human computation              capability to model itself. When this meta-cognition is
  (Quinn and Bederson 2011).                                         present, the system has knowledge about its own capabil-
                                                                     ities and limitations, which may be very helpful for safety
• Distribution: Another way of looking at integration is             in advanced systems, but may also lead to some degree of
  in terms of distribution, which is also an important facet         self-awareness. This may result, in some cases, in antiso-
  of analysis in AI (Martınez-Plumed et al. 2018b). Today,           cial or suicidal behaviours.
  through the overall use of network connectivity and “the
  cloud”, many systems are distributed in terms of hard-          The use of models may dramatically expand safety-relevant
  ware, software, data and compute. Under this trend, only        characteristics, e.g., by conferring the ability to simulate and
  systems embedded in critical and military applications          evaluate scenarios through causal and counterfactual reason-
  are devised to be as self-contained as possible. Neverthe-      ing. This therefore represents an important set of considera-
  less, distribution and redundancy are also common ways          tions for future AI systems.
Types of supervision                                              and other parts. So it is better to explore different ways and
Supervision is a way of checking and correcting the be-           degrees to which a system can be modified externally:
haviour of a system through observation or interaction, and       • Interruptible: The system has a switch-off command or
hence it is crucial for safety. Supervision can be in the form      modification option to switch it off.
of corrected values for predictive models such as classifi-
cation or regression, but it can also be partial (the answer      • Parametric modification: Many systems are regulated or
is wrong, but the right answer is not given). Supervision can       calibrated with parameters or weights. When these param-
also be much more subtle than this. For instance, a diagnosis       eters have a clear relation to the behaviour of a system
assistant that suggests a possible diagnosis to a doctor can be     (e.g., an intelligent thermostat), this can be an effective,
designed to get no feedback once deployed. However, some            bounded and simple way of modifying the system.
kinds of feedback can still reach the system in terms of the      • Algorithmic modification: This can include new func-
distribution or frequency of tasks (questions), or through the      tionalities, bug fixes, updates, etc. Many software issues
way the tasks are posed to the system.                              are caused, and are magnified, by these interventions.
   Consequently there are several degrees and qualities of          Modifications can be limited in expressiveness, such as
supervision, and this may depend on the system. For in-             only allowing rule deletion.
stance, in classification, one can have data for all examples
or just for a few (known as semi-supervised learning). In re-     • Resource modification: Even if the parameters or code
inforcement learning, one can have sparse versus dense re-          are not modified, the resources of the system and other
ward. In general, supervision can come in many different            dependencies previously mentioned can be limited exter-
ways, according to some criteria:                                   nally, e.g., the computational resources.
• Completeness: Supervision can be very partial (sig-             On the other hand, systems can modify themselves (inter-
   nalling incorrectness), more informative (showing the          nally). There are many varieties here too:
   correct way) or complete (showing all positive and neg-
                                                                  1. No self-modification, no memory: The system has no
   ative ways of behaving in the environment).
                                                                     memory, and works as being reset for any new input or
• Procedurality: Beyond what is right and wrong, feedback            interaction. Many functional systems (mapping inputs to
   can be limited about the result or can show the whole pro-        outputs) are of this kind. Note, however, that the environ-
   cess, as in the case of learning by demonstration.                ment does have memory, so some systems, such as a vi-
• Density: Supervision can be sparse or dense. Of course             sion system or a non-cognitive robot, can be affected by
   the denser the better (but more expensive), and the less          the past and become a truly cognitive system.
   autonomous the system is considered.                           2. Partially self-modifying: The algorithms in the learner or
• Adaptiveness: Supervision can be ‘intelligent’ as well,            solver cannot be modified but its data or knowledge (in
   which happens in machine teaching situations when ex-             the form of learned weights or rules) can be modified by
   amples or interactions are chosen such that the system            a general algorithm, which is fixed. Many learning sys-
   reaches the desired behaviour as soon as possible.                tems are of this kind, if the system has both a learning
• Responsiveness: In areas such as query learning or active          algorithm and one or more learned models.
   learning, the system can ask questions or undertake ex-        3. Totally self-modifying: The system can modify any part
   periments at any time. The results can come in real time          of its code. Not many operational systems have these abil-
   or may have a delay or be given in batches.                       ities, as they become very unstable. However, some types
For many systems, supervision can have a dedicated chan-             of evolutionary computation may have this possibility, if
nel (e.g., rewards in RL) but for others it can be performed         evolution can also be applied to the rules of the evolution.
by modification of the environment (e.g., moving objects or       Finally, all these categories can be selected for different pe-
smiling), even to the extent that the system is unaware these     riods of time. For instance, it is common to separate between
changes have a guiding purpose (e.g., clues).                     training, test/validation and deployment. For training, a high
Types of modification                                             degree of self-modification (and hence adaptation) is well
                                                                  accepted, but then this is usually constrained for validation
Some of the most recurrent issues in AI safety – including        and deployment. Note that these stages apply for both exter-
many covered in the section about known AI safety charac-         nal and internal sources of modification. One important dan-
teristics – are related to ways in which the system can be        ger is that a well-validated system may be subject to some
modified. This includes issues such as wire-heading or algo-      late external or internal modification just before deployment.
rithmic self-improvement. Here, in the first place, we have       In this case, all the validation effort may become void2 .
to distinguish between whether the system can be modified            One of the major modern concerns in AI safety is that it
by the environment, or by the system itself. Modifications        will be desirable for some systems to learn during deploy-
by the environment can be intentional (and hence related to
supervision), but they can also be unintentional (code cor-         2
                                                                      OpenAI   Dota  is  an   example: https://
ruption from external sources). Even a system whose core          blog.openai.com/the-international-2018-
code cannot be modified by an external source, may be af-         results/, https://www.theregister.co.uk/2018/
fected in state or code by regular inputs, physical equipment     08/24/openai bots eliminated dota 2/
ment, in order for them to be adaptive3 . For instance, many            A second question is how these goals are followed by the
personal assistants are learning from our actions continually.        system. There are at least three possible dimensions here:
While this may introduce many risks for more powerful sys-
                                                                      • Immediateness: The system may maximise the function
tems, forbidding learning outside the lab would make many
                                                                        for the present time or in the limit, or something in be-
potential applications of AI impossible. However, adaptive
                                                                        tween. Many schemata of discounted rewards in rein-
systems are full of engineering problems; some must even
                                                                        forcement learning are used as trade-offs between short-
have a limited life, as after self-modification and adaptation
                                                                        term and long-term maximisation.
they may end up malfunctioning and have to be reset or have
their ‘caches’ erased. This problem has long been of interest         • Selfishness: Focusing on individual optima might involve
in engineering (Fickas and Feather 1995).                               very bad collective results (for other agents) or even re-
                                                                        sults that could even be worse individually (tragedy of the
Types of motivation                                                     commons). Game theory provides many examples of this.
                                                                        In multi-agent RL systems, rewards can depend on the
Systems can follow a set of rules or aim at optimising a util-          well-being of other agents, or empathy can be introduced.
ity function. Most systems are actually hybrid, as it is diffi-
cult to establish a crisp line between procedural algorithms          • Conscientiousness: The system may be fully committed
and optimisation algorithms. Through layers of abstraction              to maximising the goal, or some random or exploratory
in these processes, we ultimately get the impression that a             actions are allowed, even if they deviate occasionally from
system is more or less autonomous. If the system is appar-              the goal. When it is on purpose, this is usually intended
ently pursuing a goal, what are the drivers that make a sys-            to provide robustness or to avoid local minima, but these
tem prefer or follow some behaviours over others? These                 deviations can take the system to dangerous areas.
behaviours may be based on some kind of internal represen-            Modulating optimisation functions to be convex with a non-
tation of a goal, as we discussed when dealing with antici-           asymptotic maximum, beyond which further effort is futile,
pation, or on a metric of how close the system is to the goal.        may be a sensible thing as it provides a stop condition by
Then the systems can follow an optimisation process that              definition. A self-imposed cap can always be shifted if ev-
tries to maximise some of these quality functions.                    erything is under control once the limit is reached.
   Quality or utility functions usually map inputs and out-              Note that the kind of interaction seen before is key for
puts into some values that are re-evaluated periodically or           the internal quality metric or goal. For instance, in asyn-
after certain events. Examples of these functions are accu-           chronous RL, “the time can be intentionally modulated by
racy, aggregated rewards or some kind of empowerment or               the agent” to get higher rewards without really performing
other types of intrinsic motivation (Klyubin, Polani, and Ne-         better (Hernández-Orallo 2010). And, of course, a common
haniv 2005; Jung, Polani, and Stone 2011). The same system            problem for motivation is reward hacking.
might have several quality functions that can be opposed, so
trade-offs have to be chosen. The general notion of rational-         Types of achievement
ity in decision-making is related to these motivations.
                                                                      Ultimately, an AI system is conceived to achieve a task, in-
   But what are the characteristics of the goals an AI system
                                                                      dependently of how well motivated the system is for it. Con-
can have in the first place? We outline several dimensions:
                                                                      sequently, the external degree of achievement must be dis-
• Goal variability: Are goals hard-coded or change with               tinguished from the motivation or quality metric the system
  time? Do they change autonomously or through instruc-               uses to function, as discussed in the previous subsection.
  tion? Who can change the goals and how? For instance,               The misalignment between the internal goal of the system
  what orders can a digital assistant take and from whom?             and the task specification is the cause of many safety issues
                                                                      in AI, unlike formal methods in software engineering, when
• Goal scrutability: Are the (sub)goals identifiable and ul-
                                                                      requirements are converted into correct code.
  timately explainable? Do they come from rules or are they
                                                                         Focusing on the task specification, we must first recognise
  inferred from data, e.g., error in classification or observ-
                                                                      that different actors may have different interests. A cognitive
  ing humans in inverse reinforcement learning?
                                                                      assistant, for instance, may be understood by the user as be-
• Goal rationality: Are the goals amenable to treatment               ing very helpful, making life easier. However, for the com-
  within a rational choice framework? If several goals are            pany selling the cognitive assistant, the task is ultimately to
  set, are they consistent? If not, how does the system re-           produce revenue with the product. Both requirements are not
  solve inconsistencies or set new goals?                             always compatible and this may affect the definition of the
                                                                      goals of the system, as some of the aims may not be coded
Note that this is closely related to the types of modification,       or motivated in a transparent way, but usually incorporated
as changing or resolving goals may require self-modification          in indirect ways. Second, even if the requirements include
and/or external modification.                                         all possible internalities (what the system has to do), there
   3
    Nature has found many ways of regulating self-modification.       are also many externalities and footprints (Martı́nez-Plumed
Many animals have a higher degree of plasticity at birth, becoming    et al. 2018a) (including the infinitely many things that the
more conservative and rigid in older stages (Gopnik et al. 2017).     system should not do) that affect how positive or negative its
One key question about cognition is whether this is a contingent or   overall effect is. Regarding these two issues, task specifica-
necessary process, and whether it is influenced by safety issues.     tion can vary in precision and objectivity:
• Task precision: The evaluation metric to determine the                 4. General system: multitask: The system must solve differ-
  success of an agent can be formal or not. For instance, the               ent tasks, without a fixed repertoire.
  accuracy of a classifier or the squared error of a regression          5. Incremental system: The system must solve a sequence of
  model are precisely defined metrics. However, in many                     tasks, with some dependencies between them.
  other cases, we have a utility function that depends on
  variables that are usually imprecise or uncertain, such as             Any metric examining the benefits and possible risks of a
  the quality of a smart vacuum cleaner.                                 system must take the factors described above into account.
• Task objectivity: A metric can be objective or subjective.
  We tend to associate precise metrics with objectiveness                                       Conclusion
  and imprecise metrics with subjectivity, but subjectivity              Many accounts of AI safety focus on “either RL agents or
  simply means that the evaluation changes depending on                  supervised learning systems” assuming “similar issues are
  the subject. For instance, the quality of a spam filter (a             likely to arise for other kinds of AI systems” (Amodei et
  precisely-evaluated classifier) changes depending on the               al. 2016). This paper has surveyed a wide range of char-
  cost matrices of different users, and the quality of a smart           acteristics of AI systems, so that future research can map
  vacuum cleaner based on fuzzy variables such as cleanli-               AI safety challenges against AI research paradigms in more
  ness or disruption can be weighted by a fixed formula.                 precise ways in order to ascertain whether particularly safety
Some of the tasks or targets that are most commonly ad-                  challenges manifest similarly in different paradigms. This
vocated in the ethics and safety of AI literature are often              aims to address an increasing concern that the current dom-
very imprecise and subjective, such as “well-being”, “social             inant paradigm for a large proportion of AI safety research
good”, “beneficial AI”, “alignment”, etc. Note that the prob-            may be too narrow: discrete-time RL systems with train/test
lem is not related to the goals of the system (an inverse rein-          regimes, assuming gradient-based learning on a parametric
forcement learning system can successfully identify the dif-             space, with a utility function that the system must optimise
ferent wills of a group of people), but rather about whether             (Gauthier 2018; Krakovna 2018).
the task is ultimately achieved, or the well-being or hap-                  Taxonomies of potentially safety-relevant characteristics
piness of the user. Determining this is controversial, even              of AI systems, as introduced in this paper, are intended to
when analysed in a scientific way (Alexandrova 2017).                    provide a good complement to recent work on taxonomies
   An overemphasis on tracking metrics (Goodhart’s law) is               of technical AI safety problems. For instance, Ortega (2018)
sometimes blamed, but the alternative is not usually better.             presents three main areas: specification, ensuring that an AI
Some safety problems are not created by an overemphasis                  system’s behaviour aligns with the operator’s true intentions;
on a metric (Manheim and Garrabrant 2018), but ultimately                robustness, ensuring that an AI system continues to operate
by a metric that is too narrow or shortsighted, and does not             within safe limits upon perturbation, and assurance, ensur-
adequately capture progress towards the goal.                            ing that we understand and control AI systems during oper-
                                                                         ation. Almost all characteristics outlined in this paper have
   In all these cases, we have to distinguish whether the met-
                                                                         a role to play for specification, robustness and assurance.
ric relates to (i) the internal goals that the system should
                                                                            Taxonomies are rarely definitive, and the characterisation
have, (ii) the external evaluation of task performance, or (iii)
                                                                         presented here does not consider in full some quantitative
our ultimate desires and objective4 . Motivations, achieve-
                                                                         features such as performance, autonomy and generality. A
ment and supervision are closely related, but may be dif-
                                                                         proper evaluation of how the kind and degree of intelligence
ferent. For a maze, e.g., the goal for the AI system may be
                                                                         can affect safety issues is also an important area of analysis,
to get out of the maze as soon as possible, but a competition
                                                                         both theoretically (Hernández-Orallo 2017) and experimen-
could be based on minimising the cells that are stepped more
                                                                         tally (Leike et al. 2017). AI research has explored different
than once, and supervision may include indications of direc-
                                                                         paradigms in the past, and will continue to do so in the fu-
tion to the shortest route to the exit. These are three different
                                                                         ture. Along the way, many different system characteristics
criteria which may be well or poorly aligned.
                                                                         and design choices have been presented to developers. We
   Even more comprehensively – and related to the concept
                                                                         can expect even more to be developed as AI research pro-
of persistence –, a system may be analysed for a range of
                                                                         gresses. Consequently, the area of AI safety must acquire
tasks, under different replicability situations:
                                                                         more structure and richness in how AI is characterised and
1. Disposable system: single task, single use: The system is             analysed, to provide tailored guidance for different contexts,
   used for one task that only takes place once.                         architectures and domains. There is a potential risk to over-
                                                                         relying on our best current theories of AI when considering
2. Repetitive system: single task, several uses: The system
                                                                         AI safety. Instead, we aim to encourage a diverse set of per-
   must solve many instances of the same specific task.
                                                                         spectives, in order to anticipate and mitigate as many safety
3. Menu system: multitask: The system must solve different               concerns as possible.
   tasks, under a fixed repertoire of tasks.
   4
                                                                                            Acknowledgments
     Ortega et al (2018) distinguish between “ideal specification
(the ‘wishes’)” and “design specification”, which must be com-           FMP and JHO were supported by the EU (FEDER) and
pared with the revealed specification (the “behaviour”). The design      the Spanish MINECO under grant TIN 2015-69175-C4-1-
specification fails to distinguish external metric from internal goal.   R, by Generalitat Valenciana (GVA) under grant PROME-
 TEOII/2015/013 and by the U.S. Air Force Office of Sci-                Transactions on Computer-Aided Design of Integrated Circuits
 entific Research under award number FA9550-17-1-0287.                  and Systems 27(7):1165–1178.
 FMP was also supported by INCIBE (Ayudas para la                      [Everitt and Hutter 2018] Everitt, T., and Hutter, M. 2018.
 excelencia de los equipos de investigación avanzada en                The alignment problem for bayesian history based reinforce-
 ciberseguridad), the European Commission, JRC’s Cen-                   ment learners. http://www.tomeveritt.se/papers/
 tre for Advanced Studies, HUMAINT project (Expert                      alignment.pdf/.
 Contract CT-EX2018D335821-101), and UPV PAID-06-18                    [Everitt, Lea, and Hutter 2018] Everitt, T.; Lea, G.; and Hutter,
 Ref. SP20180210. JHO was supported by a Salvador de                    M. 2018. Agi safety literature review. Proceedings of the
 Madariaga grant (PRX17/00467) from the Spanish MECD                    Twenty-Seventh International Joint Conference on Artificial In-
 for a research stay at the Leverhulme Centre for the Fu-               telligence (IJCAI-18), arXiv preprint version:1805.01109.
 ture of Intelligence (CFI), Cambridge, and a BEST grant               [Fickas and Feather 1995] Fickas, S., and Feather, M. S. 1995.
 (BEST/2017/045) from GVA for another research stay also                Requirements monitoring in dynamic environments. In IEEE
 at the CFI. JHO and SOH were supported by the Future                   Intl Symposium on Requirements Engineering, 140–147.
 of Life Institute (FLI) grant RFP2-152. SOH was also sup-
                                                                       [Ford et al. 2015] Ford, K. M.; Hayes, P. J.; Glymour, C.; and
 ported by the Leverhulme Trust Research Centre Grant RC-
                                                                        Allen, J. 2015. Cognitive orthoses: toward human-centered ai.
 2015-067 awarded to the Leverhulme Centre for the Future               AI Magazine 36(4):5–8.
 of Intelligence, and a a grant from Templeton World Charity
 Foundation.                                                           [Garrabrant and Demski 2018] Garrabrant, S., and Demski, A.
                                                                        2018. Embedded agency. AI Alignment Forum.
                          References                                   [Gasparik, Gamble, and Gao 2018] Gasparik, A.; Gamble, C.;
[Alexandrova 2017] Alexandrova, A. 2017. A Philosophy for               and Gao, J. 2018. Safety-first ai for autonomous data centre
 the Science of Well-being. Oxford University Press.                    cooling and industrial control. DeepMind Blog.
[Amodei et al. 2016] Amodei, D.; Olah, C.; Steinhardt, J.;             [Gauthier 2018] Gauthier, J.          2018.       Conceptual is-
 Christiano, P.; Schulman, J.; and Mané, D. 2016. Concrete             sues in AI safety: the paradigmatic gap.                  http:
 problems in ai safety. arXiv preprint arXiv:1606.06565.                //www.foldl.me/2018/conceptual-issues-ai-
                                                                        safety-paradigmatic-gap/.
[Armstrong and Levinstein 2017] Armstrong, S., and Levin-
 stein, B. 2017. Low impact artificial intelligences. arXiv            [Geffner 2018] Geffner, H. 2018. Model-free, model-based, and
 preprint arXiv:1705.10720.                                             general intelligence. arXiv preprint arXiv:1806.02308.
[Armstrong 2017] Armstrong, S. 2017. Good and safe uses of             [Gopnik et al. 2017] Gopnik, A.; OGrady, S.; Lucas, C. G.;
 ai oracles. arXiv preprint arXiv:1711.05541.                           Griffiths, T. L.; Wente, A.; Bridgers, S.; Aboody, R.; Fung, H.;
                                                                        and Dahl, R. E. 2017. Changes in cognitive flexibility and
[Babcock, Kramár, and Yampolskiy 2016] Babcock,                 J.;    hypothesis search across human life history from childhood to
 Kramár, J.; and Yampolskiy, R. 2016. The AGI contain-                 adolescence to adulthood. PNAS 114(30):7892–7899.
 ment problem. In AGI Conf. Springer. 53–63.
                                                                       [Hernández-Orallo 2010] Hernández-Orallo, J. 2010. On eval-
[Barmpalias and Dowe 2012] Barmpalias, G., and Dowe, D. L.
                                                                        uating agent performance in a fixed period of time. In Artificial
 2012. Universality probability of a prefix-free machine. Phil.
                                                                        General Intelligence, 3rd Intl Conf, ed., M. Hutter et al, 25–30.
 Trans. R. Soc. A 370(1971):3488–3511.
                                                                       [Hernández-Orallo 2017] Hernández-Orallo, J. 2017. The Mea-
[Bernstein and Vazirani 1997] Bernstein, E., and Vazirani, U.
                                                                        sure of All Minds: Evaluating Natural and Artificial Intelli-
 1997. Quantum complexity theory. SIAM Journal on comput-
                                                                        gence. Cambridge University Press.
 ing 26(5):1411–1473.
[Bonabeau et al. 1999] Bonabeau, E.; Dorigo, M.; Théraulaz,           [Jung, Polani, and Stone 2011] Jung, T.; Polani, D.; and Stone,
 G.; and Theraulaz, G. 1999. Swarm intelligence: from natu-             P. 2011. Empowerment for continuous agentenvironment sys-
 ral to artificial systems. Oxford university press.                    tems. Adaptive Behavior 19(1):16–39.
[Bostrom 2014] Bostrom, N. 2014. Superintelligence: Paths,             [Klyubin, Polani, and Nehaniv 2005] Klyubin, A. S.; Polani,
 dangers, strategies. Oxford University Press.                          D.; and Nehaniv, C. L. 2005. All else being equal be em-
                                                                        powered. In European Conference on Artificial Life, 744–753.
[Brundage et al. 2018] Brundage, M.; Avin, S.; Clark, J.; Toner,
 H.; Eckersley, P.; Garfinkel, B.; Dafoe, A.; Scharre, P.; Zeitzoff,   [Krakovna 2018] Krakovna, V.              2018.        Discussion
 T.; Filar, B.; Anderson, H.; Roff, H.; Allen, G. C.; Steinhardt,       on the machine learning approach to AI safety.
 J.; Flynn, C.; Ó hÉigeartaigh, S.; Beard, S.; Belfield, H.; Far-     http://vkrakovna.wordpress.com/2018/11/
 quhar, S.; Lyle, C.; Crootof, R.; Evans, O.; Page, M.; Bryson,         01/discussion-on-the-machine-learning-
 J.; Yampolskiy, R.; and Amodei, D. 2018. The malicious use of          approach-to-ai-safety/.
 artificial intelligence: Forecasting, prevention, and mitigation.     [Leike et al. 2017] Leike, J.; Martic, M.; Krakovna, V.; Ortega,
 arXiv preprint arXiv:1802.07228.                                       P. A.; Everitt, T.; Lefrancq, A.; Orseau, L.; and Legg, S. 2017.
[Coulouris, Dollimore, and Kindberg 2011] Coulouris, G. F.;             AI safety gridworlds. arXiv preprint arXiv:1711.09883.
 Dollimore, J.; and Kindberg, T. 2011. Distributed systems:            [Manheim and Garrabrant 2018] Manheim, D., and Garrabrant,
 concepts and design. Fifth edition, Pearson.                           S. 2018. Categorizing variants of Goodhart’s law. arXiv
[D’silva, Kroening, and Weissenbacher 2008] D’silva,             V.;    preprint arXiv:1803.04585.
 Kroening, D.; and Weissenbacher, G. 2008. A survey of                 [Martı́nez-Plumed et al. 2018a] Martı́nez-Plumed, F.; Avin, S.;
 automated techniques for formal software verification. IEEE            Brundage, M.; Dafoe, A.; hÉigeartaigh, S. Ó.; and Hernández-
 Orallo, J. 2018a. Accounting for the neglected dimensions of
 ai progress. arXiv preprint arXiv:1806.00610.
[Martınez-Plumed et al. 2018b] Martınez-Plumed, F.; Loe,
 B. S.; Flach, P.; O hEigeartaigh, S.; Vold, K.; and Hernández-
 Orallo, J. 2018b. The facets of artificial intelligence: A
 framework to track the evolution of AI. IJCAI.
[Mnih 2015] Mnih, V. e. a. 2015. Human-level control through
 deep reinforcement learning. Nature 518:529–533.
[Omohundro 2008] Omohundro, S. M. 2008. The basic ai
 drives. Artificial General Intelligence 171:483–493.
[Ortega and Maini 2018] Ortega, P. A., and Maini, V.
 2018.      Building safe artificial intelligence: specification,
 robustness, and assurance.            https://medium.com/
 @deepmindsafetyresearch/building-safe-
 artificial-intelligence-52f5f75058f1.
[Pecka and Svoboda 2014] Pecka, M., and Svoboda, T. 2014.
 Safe exploration techniques for reinforcement learning–an
 overview. In International Workshop on Modelling and Sim-
 ulation for Autonomous Systems, 357–375. Springer.
[Quinn and Bederson 2011] Quinn, A. J., and Bederson, B. B.
 2011. Human computation: a survey and taxonomy of a grow-
 ing field. In SIGCHI conf. on human factors in computing sys-
 tems, 1403–1412. ACM.
[Soares et al. 2015] Soares, N.; Fallenstein, B.; Armstrong, S.;
 and Yudkowsky, E. 2015. Corrigibility. In Workshops at the
 Twenty-Ninth AAAI Conference on Artificial Intelligence.
[Turchetta, Berkenkamp, and Krause 2016] Turchetta,          M.;
 Berkenkamp, F.; and Krause, A. 2016. Safe exploration in
 finite Markov decision processes with Gaussian processes. In
 NIPS, 4312–4320.
[Yampolskiy 2012] Yampolskiy, R. 2012. Leakproofing the sin-
 gularity artificial intelligence confinement problem. Journal of
 Consciousness Studies 19(1-2):194–214.
[Yampolskiy 2016] Yampolskiy, R. V. 2016. Taxonomy of path-
 ways to dangerous artificial intelligence. In AAAI Workshop:
 AI, Ethics, and Society.

</pre>