<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Robust End-to-End Alignment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leˆ Nguyeˆn Hoang EPFL Chemin Alan Turing</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lausanne</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Switzerland</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Robust alignment is arguably both critical and extremely challenging. Loosely, it is the problem of designing algorithmic systems with strong guarantees of always being beneficial for mankind. In this paper, we propose a preliminary research program to address it in a reinforcement learning framework. This roadmap aims at decomposing the end-toend alignment problem into numerous more tractable subproblems. We hope that each subproblem is sufficiently orthogonal to others to be tackled independently, and that combining the solutions to all such subproblems may yield a solution to alignment.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        As they are becoming more and more capable and
ubiquitous, AIs are raising numerous concerns, including
fairness, privacy, filter bubbles, addiction, job displacement or
even existential risks
        <xref ref-type="bibr" rid="ref53 ref62">(Russell, Dewey, and Tegmark 2015;
Tegmark 2017)</xref>
        . It has been argued that aligning the goals
of AI systems with humans’ preferences would be an
efficient way to make them reliably beneficial and to avoid
potential catastrophic risks
        <xref ref-type="bibr" rid="ref14 ref30 ref31 ref8">(Bostrom 2014; Hoang 2018a)</xref>
        .
In fact, given the global influence of today’s large-scale
recommender systems
        <xref ref-type="bibr" rid="ref37">(Kramer, Guillory, and Hancock 2014)</xref>
        ,
it already seems urgent to propose even partial solutions to
alignment.
      </p>
      <p>
        Unfortunately, it has also been argued that alignment is
an extremely difficult problem. In fact,
        <xref ref-type="bibr" rid="ref14">(Bostrom 2014)</xref>
        argues that it “is a research challenge worthy of some of the
next generation’s best mathematical talent”. To address it,
the Future of Life Institute proposed a landscape of AI safety
research1. Meanwhile,
        <xref ref-type="bibr" rid="ref57 ref59">(Soares 2015; Soares and Fallenstein
2017)</xref>
        listed important ideas in this line of work. We hope
that this paper will contribute to outline the main challenges
posed by alignment.
      </p>
      <p>In particular, we shall introduce a complete research
program to robustly align AIs. Robustness here refers to
numerous possible failure modes, including overfitting, hazardous
exploration, evasion attacks, poisoning attacks, crash
tolerance, Byzantine resilience, reward hacking and wireheading.
To guarantee such a robustness, we argue that it is desirable
to structure (at least conceptually) our AI systems in their
1https://futureoflife.org/landscape/
entirety. This motivated us to propose a roadmap for robust
end-to-end alignment.</p>
      <p>While much of our proposal is speculative, we believe that
several of the ideas presented here will be critical for AI
safety and alignment. More importantly, we hope that this
will be a useful roadmap for both AI experts and non-experts
to better estimate how they can best contribute to the effort.</p>
      <p>Given the complexity of the problem, our roadmap here
will likely be full of gaps and false good ideas. It is important
to note that our purpose is not to propose a definite perfect
solution. Rather, we aim at presenting a sufficiently good
starting point for others to build upon.</p>
    </sec>
    <sec id="sec-2">
      <title>The Roadmap</title>
      <p>Our roadmap consists of identifying key steps to alignment.
For the sake of exposition, these steps will be personified
by 5 characters, called Alice, Bob, Charlie, Dave and Erin.
Roughly speaking, Erin will be collecting data from the
world, Dave will use these data to infer the likely states of
the world, Charlie will compute the desirability of the likely
states of the world, Bob will derive incentive-compatible
rewards to motivate Alice to take the right decision, and
Alice will optimize decision-making. This decomposition is
graphically represented in Figure 1.</p>
      <p>Evidently, Alice, Bob, Charlie, Dave and Erin need not be
5 different AIs. Typically, it may be much more
computationally efficient to merge Charlie and Dave. Nevertheless,
at least for pedagogical reasons, it seems useful to first
dissociate the different roles that these AIs have.</p>
      <p>In the sequel, we shall further detail the challenges posed
by each of the 5 AIs. We shall also argue that, for robustness
and scalability reasons, these AIs will need to be further
divided into many more AIs. We will see that this raises
additional challenges. We shall also make a few non-technical
remarks, before concluding.</p>
    </sec>
    <sec id="sec-3">
      <title>Alice’s Reinforcement learning</title>
      <p>It seems that today’s most promising framework for
largescale AIs is that of reinforcement learning. In reinforcement
learning, an AI can be regarded as a decision-making
process. At time t, the AI observes some state of the world st.
Depending on its inner parameters t, it then takes (possibly
randomly) some action at.</p>
      <p>The decision at then influences the next state and turns it
into st+1. The transition from st to st+1 given action at is
usually assumed to be nondeterministic. In any case, the AI
then receives a reward Rt+1. The internal parameters t of
the AI may then be updated into t+1, depending on
previous parameters t, action at, state st+1 and reward Rt+1.</p>
      <p>Note that this is a very general framework. In fact, we
humans are arguably (at least partially) subject to this
framework. At any point in time, we observe new data st that
informs us about the world. Using an inner model of the world
t, we then infer what the world probably is like, which
motivates us to take some action at. This may affect what likely
next data st+1 will be observed, and may be accompanied
with a rewarding (or painful) feeling Rt+1, which will
motivate us to update our inner model of the world t into t+1.</p>
      <p>Let us call Alice the AI in charge of performing this
reinforcement learning reasoning. Alice can thus be viewed as
an algorithm, which inputs observed states st and rewards
Rt, and undertakes actions at so as to typically maximize
some discounted sum of expected future rewards.</p>
      <p>Such actions will probably be mostly of the form of
messages sent through the Internet. This may sound benign. But
it is not. The YouTube recommender system might suggest
billions of antivax videos, causing a major decrease of
vaccination and an uprise of deadly diseases. Worse, if an AI is
in control of 3D-printers, then a message that tells them to
construct killer drones to cause a genocide would be
catastrophic. On a brighter note, if an AI now promotes
convincing eco-friendly messages every day to billions of people,
the public opinion on climate change may greatly change.</p>
      <p>Note that, as opposed to all other components, in some
sense, Alice is the real danger. Indeed, in our framework,
she is the only one that really undertakes actions. More
precisely, only her actions will be unconstrained (although
others highly influence her decision-making and are thus
critical as well).</p>
      <p>
        As a result, it is of the utmost importance that Alice be
well-designed. Some of the past work
        <xref ref-type="bibr" rid="ref12 ref18 ref33 ref51">(Orseau and
Armstrong 2016; El Mhamdi et al. 2017)</xref>
        have proposed to
restrict the learning capabilities of Alice to provide provable
desirable properties. Typically, they proposed to allow only
a subclass of learning algorithms, i.e. of update rules of t+1
as a function of ( t; at; st+1; Rt+1). However, such
restrictions might be too costly. And this may be a big problem.
      </p>
      <p>Indeed, there is already a race between competing
companies in competing countries to construct powerful AIs.
While it might be possible for some countries to impose
some restrictions to some AIs of some companies, it is
unlikely that all companies of all countries will accept to be
restricted, especially if the restrictions are too constraining.
In fact, AI safety will be useful only if the most powerful
AIs are all subject to safety measures. As a result, the safety
measures that are proposed should not be too
constraining. In other words, there are constraints on the safety
constraints that can be imposed. This is what makes AI safety
so challenging.</p>
      <p>
        As a result, what is perhaps more interesting are the ideas
proposed by
        <xref ref-type="bibr" rid="ref1">(Amodei et al. 2016)</xref>
        to make reinforcement
learning safer, especially using model lookahead. This
essentially corresponds to Alice simulating many likely
scenarii before undertaking any action. More generally, Alice
faces a safe exploration problem.
      </p>
      <p>But this is not all. Given that AIs will likely be based on
machine learning, and given the lack of verification
methods for AIs obtained by machine learning, we should not
expect AIs to be correct all the time. Just like humans, AIs
will likely be sometimes wrong. But this is extremely
worrysome. Indeed, even if an AI is right 99.9999% of the time,
it will still be wrong one time out of a million. Yet, AIs like
recommender systems or autonomous cars take billions of
decisions every day. In such cases, thousands of AI
decisions may be unboundedly wrong every day!</p>
      <p>
        This problem can become even more worrysome if we
take into account the fact that hackers may attempt to take
advantage of the AIs’ deficiencies. Such hackers may
typically submit only data that corresponds to cases where the
AIs are wrong. This is known as evasion attacks
        <xref ref-type="bibr" rid="ref22 ref41 ref57 ref61">(Lowd and
Meek 2005; Su, Vargas, and Kouichi 2017; Gilmer et al.
2018)</xref>
        . To avoid evasion attacks, it is crucial for an AI to
never be unboundedly wrong, e.g. by reliably measuring its
own confidence in its decisions and to ask for help in cases
of great uncertainty.
      </p>
      <p>
        Now, even if Alice is well-designed, she will only be an
effective optimization algorithm. Unfortunately, this is no
guarantee of safety or alignment. Typically, because of
humans’ well-known addiction to echo chambers
        <xref ref-type="bibr" rid="ref27">(Haidt 2012)</xref>
        ,
a watch-time maximization YouTube recommender AI may
amplify filter bubbles, which may lead to worldwide
geopolitical tensions. Both misaligned and unaligned AIs will
likely lead to very undesirable consequences.
      </p>
      <p>
        In fact,
        <xref ref-type="bibr" rid="ref14">(Bostrom 2014)</xref>
        even argues that, to best reach its
goals, any sufficiently strategic AI will likely first aim at
socalled instrumental goals, e.g. gaining vastly more resources
and guaranteeing self-preservation. But this is very unlikely
to be in humans’ best interests. In particular, it will likely
motivate the AI to undertake actions that we would not
regard as desirable.
      </p>
      <p>To make sure that Alice will want to behave as we want
her to, it seems critical to at least partially control the
observed state st+1 or the reward Rt+1. Note that this is
similar to the way children are taught to behave. We do so by
exposing them to specific observed states, by punishing them
when the sequence (st; at; st+1) is undesirable, and by
rewarding them when the sequence (st; at; st+1) is desirable.</p>
      <p>Whether or not Alice’s observed state st is constrained,
her rewards Rt are clearly critical. They are her incentives,
and will thus determine her decision-making. Unfortunately,
determining the adequate rewards Rt to be given to Alice is
an extremely difficult problem. It is, in fact, the key to
alignment. Our roadmap to solve it identifies 4 key steps
incarnated by Erin, Dave, Charlie and Bob.</p>
    </sec>
    <sec id="sec-4">
      <title>Erin’s data collection problem</title>
      <p>In order to do good, it is evidently crucial to be given a lot of
reliable data. Indeed, even the most brilliant mind will be
unable to know anything about the world if it does not have any
data from that world. This is particularly true when the goal
is to undertake desirable actions, or to make sure that one’s
action will not have potentially catastrophic consequences.</p>
      <p>Evidently, much data is already available on the Internet.
It is likely that any large-scale AI will have access to the
Internet, as is already the case of the Facebook recommender
system. However, it is important to take into account the
fact that the data on the Internet is not always fully reliable.
It may be full of fake news, fraudulent entries, misleading
videos, hacked posts and corrupted files.</p>
      <p>
        It may then be relevant to invest in more reliable and
relevant data collection. This would be Erin’s job. Typically,
Erin may want to collect economic metrics to better assess
needs. Recently, it has been shown that satellite images
combined with deep learning allow to compute all sorts of
useful economic indicators
        <xref ref-type="bibr" rid="ref35">(Jean et al. 2016)</xref>
        , including poverty
risks and agricultural productivity. It is possible that the use
of still more sensors can further increase our capability to
improve life standards, especially in developing countries.
      </p>
      <p>
        To guarantee the reliability of such data, cryptographic
and distributed computing solutions are likely to be
useful as well, as they already are on the web. In
particular, distributed computing, combined with recent
Byzantineresilient consensus algorithms like Blockchain
        <xref ref-type="bibr" rid="ref47">(Nakamoto
2008)</xref>
        or Hashgraph
        <xref ref-type="bibr" rid="ref10">(Baird 2016)</xref>
        , could guarantee the
reliable storage and traceability of critical information.
      </p>
      <p>
        Note though that such data collection mechanisms could
pose major privacy issues. It is a major current challenge
to balance the usefulness of collected data and the privacy
violation they inevitably cause. Some possible solutions
include differential privacy
        <xref ref-type="bibr" rid="ref17">(Dwork, Roth, and others 2014)</xref>
        , or
weaker versions like generative-adversarial privacy
        <xref ref-type="bibr" rid="ref32">(Huang
et al. 2017)</xref>
        . It could also be possible to combine these with
more cryptographic solutions, like homomorphic encryption
or multi-party computation. It is interesting that such
cryptographic solutions may be (essentially) provably robust to
any attacker, including a superintelligence2.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Dave’s world model problem</title>
      <p>Unfortunately, raw data are usually extremely messy,
redundant, incomplete, unreliable, poisoning and even hacked. To
tackle these issues, it is necessary to infer the likely actual
states of the world, given Erin’s collected data. This will be
Dave’s job.</p>
      <p>
        The overarching principle of Dave’s job is probably going
to be some deep representation learning. This corresponds
to determining low-dimensional representations of
highdimensional data. This basic idea has given rise to today’s
most promising unsupervised machine learning alogrithms,
e.g. word vectors
        <xref ref-type="bibr" rid="ref45">(Mikolov et al. 2013)</xref>
        , autoencoders
        <xref ref-type="bibr" rid="ref39">(Liou,
Huang, and Yang 2008)</xref>
        and generative adversarial
networks (GANs) (Goodfellow et al. 2014).
      </p>
      <p>2The possible use of quantum computers may require
postquantum cryptography.</p>
      <p>
        Given how crucial it is for Dave to have an unbiased
representation of the world, much care will be needed to make
sure that Dave’s inference will foresee selection biases. For
instance, when asked to provide images of CEOs, Google
Image may return a greater ratio of male CEOs than the
actual ratio. More generally, such biases can be regarded as
instances of Simpson’s paradox
        <xref ref-type="bibr" rid="ref55">(Simpson 1951)</xref>
        , and boil
down to the saying ”correlation is not causation”. It seems
crucial that Dave does not fall into this trap.
      </p>
      <p>
        In fact, data can be worse than unintentionally misleading.
Given how influential Alice may be, there will likely be great
incentives for many actors to bias Erin’s data gathering, and
to thus fool Dave. This is known as poisoning attacks
        <xref ref-type="bibr" rid="ref12 ref16 ref16 ref43">(Blanchard et al. 2017; Mhamdi, Guerraoui, and Rouault 2018;
Damaskinos et al. 2018)</xref>
        . It seems extremely important that
Dave anticipate the fact that the data he was given may be
purposely biased, if not hacked. Like any good journalist,
Dave will likely need to cross information from different
sources to infer the most likely states of the world.
      </p>
      <p>
        This inference approach is well captured by the Bayesian
paradigm
        <xref ref-type="bibr" rid="ref15 ref30 ref31">(Hoang 2018b)</xref>
        . In particular, Bayes rule is
designed to infer the likely causes of the observed data D.
These causes can also be regarded as theories T (and such
theories may assume that some of the data were hacked).
Bayes rule tells us that the reliability of theory T given data
D can be derived formally by the following computation:
      </p>
      <p>P[DjT ]P[T ]
P[T jD] = :</p>
      <p>P[D]</p>
      <p>One typical instance of Dave’s job is the problem of
inferring global health from a wide variety of collected data. This
is what has been done by (Institute for Health Metrics and
Evaluation (IHME), University of Washington 2016), using
a sophisticated Bayesian model that reconstructed the likely
causes of deaths in countries where data were lacking.</p>
      <p>Importantly, Bayes rule also tells us that we should not
fully believe any single theory. This simply corresponds to
saying that data can often be interpreted in many different
mutually incompatible manners. It seems important to
reason with all possible interpretations rather than isolating a
single interpretation that may be flawed.</p>
      <p>When the space of possible states of the world is large,
which will surely be the case of Dave, it is often
computationally intractable to reason with the full posterior
distribution P[T jD]. Bayesian methods often rather propose to
sample from the posterior distribution to identify a reasonable
number of good interpretations of the data. These sampling
methods include Monte-Carlo methods, as well as
MarkovChain Monte-Carlo (MCMC) ones.</p>
      <p>In some sense, Dave’s job can be regarded as writing a
compact report of all likely states of the world, given Erin’s
collected data. It is an open question as of what language
Dave’s report will be in. It might be useful to make it
understandable by humans. But it might be too costly as well.
Indeed, Dave’s report might be billions of pages long. It could
be unreasonable or undesirable to make it humanly readable.</p>
      <p>Note also that Erin and Dave are likely to gain
cognitive capabilities over time. It is surely worthwhile to
anticipate the complexification of Erin’s data and of Dave’s
world models. It seems unclear so far how to do so. Some
high-level (purely descriptive) language to describe world
models is probably needed. In addition, this high-level
language may need to be flexible enough to be reshaped and
redesigned over time. This may be dubbed the world
description problem. It is arguably still a very open and uncharted
area of research.</p>
    </sec>
    <sec id="sec-6">
      <title>Charlie’s desirability learning problem</title>
      <p>
        Given any of Dave’s world models, Charlie’s job will then
be to compute how desirable this world model is. This is
the desirability learning problem
        <xref ref-type="bibr" rid="ref60">(Soares 2016)</xref>
        , also known
as value learning3. This is the problem of assigning
desirability scores to different world models. These desirability
scores can then serve as the basis for any agent to determine
beneficial actions.
      </p>
      <p>Unfortunately, determining what, say, the median human
considers desirable is an extremely difficult problem. But
again, it should be stressed that we should not aim at
deriving an ideal inference of what people desire. This is likely
to be a hopeless endeavor. Rather, we should try our best
to make sure that Charlie’s desirability scores will be good
enough to avoid catastrophic outcomes, e.g. world
destruction, global sufferance or major discrimination.</p>
      <p>
        One proposed solution to infer human preferences is
socalled inverse reinforcement learning
        <xref ref-type="bibr" rid="ref19 ref33 ref49 ref51">(Ng, Russell, and
others 2000; Evans, Stuhlmu¨ller, and Goodman 2016)</xref>
        .
Assuming that humans perform reinforcement learning to choose
their actions, and given examples of actions taken by
humans in different contexts, inverse reinforcement learning
infers what were the humans’ likely implicit rewards that
motivated their decision-making. Assuming we can
somehow separate humans’ selfish rewards from altruistic ones,
inverse reinforcement learning seems to be a promising first
step towards inferring humans’ preferences from data. There
are, however, many important considerations to be taken into
account, which we discuss below.
      </p>
      <p>First, it is important to keep in mind that, despite Dave’s
effort and because of Erin’s limited and possibly biased data
collection, Dave’s world model is fundamentally uncertain.
In fact, as discussed previously, Dave would probably rather
present a distribution of likely world models. Charlie’s job
should be regarded as a scoring of all such likely world
models. In particular, she should not assign a single number to
the current state of the world, but, rather, a distribution of
likely scores of the current state of the world. This
distribution should convey the uncertainty about the actual state of
the world. Besides, as we shall see, this uncertainty is likely
to be crucial for Bob to choose incentive-compatible rewards
for Alice adequately.</p>
      <p>Another challenging aspect of Charlie’s job will be to
provide a useful representation of potential human
disagreements about the desirability of different states of the world.
Humans’ preferences are diverse and may never converge.
This should not be swept under the rug. Instead, we need to
agree on some way to mitigate disagreement.</p>
      <p>3To avoid raising eyebrows, we shall try to steer away from
polarizing terminologies like values, moral or ethics.</p>
      <p>
        This is known as a social choice problem. In its general
form, it is the problem of aggregating the preferences of a
group of disagreeing people into a single preference for the
whole group that, in some sense, fairly well represents the
individuals’ preferences. Unfortunately, social choice theory
is plagued with impossibility results, e.g. Arrow’s theorem
        <xref ref-type="bibr" rid="ref6">(Arrow 1950)</xref>
        or the Gibbard-Satterthwaite theorem
        <xref ref-type="bibr" rid="ref21 ref54">(Gibbard 1973; Satterthwaite 1975)</xref>
        . Again, we should not be too
demanding regarding the properties of our preference
aggregation. Besides, this is the path taken by social choice theory,
e.g. by proposing randomized solutions to preserve some
desirable properties
        <xref ref-type="bibr" rid="ref29">(Hoang 2017)</xref>
        .
      </p>
      <p>
        One particular proposal, known as majority judgment
        <xref ref-type="bibr" rid="ref11 ref36">(Balinski and Laraki 2011)</xref>
        , may be of particular interest to
us here. Its basic idea is to choose some deciding quantile
q 2 [0; 1] (often taken to be q = 1=2). Then, for any
possible state of the world, consider all individuals’ desirability
scores for that state. This yields a distribution of humans’
preferences for the state of the world. Majority judgment
then concludes that the group’s score is the quantile q of
this distribution. If q = 1=2, this corresponds to the score
chosen by the median individual of the group.
      </p>
      <p>Now, to avoid an oppression of a majority over some
minority, it might be relevant to choose a small value of q, say
q = 0:1. This would mean that Charlie’s scoring of a state
of the world will be less than a number score, if more than
10% of the people believe that this state should be given a
score less than score. But evidently, this point is very much
debatable. It seems unclear so far how to best choose q.</p>
      <p>
        While majority judgment seems to be a promising
approach, it does raise the question of how to compare two
different individuals’ scores. It is not clear that score = 5 given
by John has a meaning comparable to Jane’s score = 5. In
fact, according to a theorem by von Neumann and
Morgenstern
        <xref ref-type="bibr" rid="ref48">(Neumann and Morgenstern 1944)</xref>
        , within their
framework, utility functions are only defined up to a positive affine
transformation. More work is probably needed to determine
how to scale different individuals’ utility functions
appropriately, despite previous attempts in special cases
        <xref ref-type="bibr" rid="ref28 ref33 ref51">(Hoang,
Soumis, and Zaccour 2016)</xref>
        . Again, it should be stressed that
we should not aim at an ideal solution; a workable
reasonable solution is much better than no solution at all.
      </p>
      <p>
        Now, arguably, humans’ current preferences are almost
surely undesirable. Indeed, over the last decades,
psychology has been showing again and again that human
thinking is full of inconsistencies, fallacies and cognitive biases
        <xref ref-type="bibr" rid="ref36">(Kahneman 2011)</xref>
        . We tend to first have instinctive reactions
to stories or facts
        <xref ref-type="bibr" rid="ref13">(Bloom 2016)</xref>
        , which quickly becomes the
position we will want to defend at all costs
        <xref ref-type="bibr" rid="ref27">(Haidt 2012)</xref>
        .
Worse, we are unfortunately largely unaware of why we
believe or want what we believe or want. This means that our
current preferences are unlikely to be what we would prefer,
if we were more informed, thought more deeply, and tried to
make sure our preferences were as well-founded as possible.
      </p>
      <p>And arguably, we should prefer what we would prefer to
prefer, rather than what we instinctively prefer. Typically,
one might prefer to watch a cat video, even though one might
prefer to prefer mathematics videos over cat videos.
Desirablity scores should arguably encode what we would prefer
to prefer, rather than what we instinctively prefer.</p>
      <p>To understand, a thought experiment may be useful. Let
us imagine better versions of us. Each current me is thereby
associated with a me++. A me++ is what current me would
desire, if current me were smarter, thought much longer
about what he finds desirable, and analyzed all imaginable
data of the world. Arguably, me++’s desirability score is
“more right” than current me’s.</p>
      <p>This can be illustrated by the fact that past standards are
often no longer regarded as desirable. Our intuitions about
the desirability of slavery, homosexuality and gender
discrimination have been completely upset over the last
century, if not over the last few decades. It seems unlikely that
all of our other intuitions will never change. In particular,
it seems unlikely that me++ will fully agree with current
me. And it seems reasonable to argue that me++ would be
“more right” than current me.</p>
      <p>
        These remarks are the basis of coherent extrapolated
volition
        <xref ref-type="bibr" rid="ref65">(Yudkowsky 2004)</xref>
        . The basic idea is that we should aim
at the preferences that future versions of ourselves would
eventually adopt, if they were vastly more informed, had
much more time to ponder what they regard as desirable,
and tried their best to be better versions of themselves. In
some sense, instead of making current me’s debate about
what’s desirable (which often turns into a pointless debacle),
we should let me++’s debate. In fact, since me++’s
supposedly already know everything about other me++’s, there is
actually no point in getting them to debate. It suffices to
aggregate their preferences through some social choice
mechanism. This is the preference aggregation problem.
      </p>
      <p>It is noteworthy that we clearly have epistemic uncertainty
about me++’s. Determining me++’s desirability scores
may be called the coherent extrapoled individual volition
problem. Interestingly, this is (mostly) a prediction problem.
But it is definitely too ambitious to predict them with
absolute uncertainty. Bayes rule tells us that we should rather
describe these desirability scores by a probability
distributions of likely desirability scores.</p>
      <p>
        Such scores could also be approximated using a large
number of proxies, as is done by boosting methods
        <xref ref-type="bibr" rid="ref4">(Arora,
Hazan, and Kale 2012)</xref>
        . The use of several proxies could
avoid the overfitting of any proxy. Typically, rather than
relying solely on DALYs
        <xref ref-type="bibr" rid="ref50 ref63">(Organization and others 2009)</xref>
        , we
probably should invoke machine learning methods to
combine a large number of similar metrics, especially those that
aim at describing other desirable economic metrics, like
human development index (HDI) or gross national happiness
(GNH). Still another approach may consist of analyzing
“typical” human preferences, e.g. by using collaborative
filtering techniques
        <xref ref-type="bibr" rid="ref52">(Ricci, Rokach, and Shapira 2015)</xref>
        .
Evidently, much more research is needed along these lines.
      </p>
      <p>Computing the desirability of a given world state is
Charlie’s job. In some sense, Charlie’s job would thus be to
remove cognitive biases from our intuitive preferences, so that
they still basically reflect what we really regard as
preferable, but in a more coherent and informed manner. This is an
incredibly difficult problem, which will likely take decades
to sort out reasonably well. This is why it is of the utmost
importance that it be started as soon as possible. Let us try
our best to describe, informally and formally, what better
versions of ourselves would likely regard as desirable. Let
us try to predict the volition of me++’s.</p>
      <p>This attempt is likely going to be shocking to us all.
Indeed, we should expect that better versions of ourselves will
find desirable things that the current versions of ourselves
find repelling. Unfortunately though, we humans tend to
react poorly to disagreeing jugments. And this is likely to hold
even when the oppositions are our better selves. This poses
a great scientific and engineering challenge. How can one be
best convinced of the judgments that he or she will
eventually embrace but does not yet? In other words, how can we
quickly agree with better versions of ourselves? What could
someone else say to get me closer to my me++? This may
be dubbed the individual improvement problem.</p>
      <p>
        To address this issue,
        <xref ref-type="bibr" rid="ref15 ref34">(Irving, Christiano, and Amodei
2018)</xref>
        have discussed the possibility of setting up a debate
between opposing AIs. In particular, they asked whether a
human judge would be able to lean towards the better AI
for the right reasons. Interestingly, such a debate might
allow for significantly more powerful “proofs of superiority”
than monologues, at least if the analogy with the so-called
polynomial hierarchy of complexity theory holds.
      </p>
      <p>This question is critical for alignment as it will likely be
a key challenge to build trust in the systems we design. But
evidently, this is a more general question that should be of
interest to anyone who desires to do good.</p>
    </sec>
    <sec id="sec-7">
      <title>Bob’s incentive design</title>
      <p>The last piece of the jigsaw is Bob’s job. Bob is in charge of
computing the rewards that Alice will receive, based on the
work of Erin, Dave and Charlie. Evidently he could simply
compute the expectation of Charlie’s scores for the likely
states of the world. But this is probably a bad idea, as it
opens the door to reward hacking.</p>
      <p>Recall that Alice’s goal is to maximize her discounted
expected future rewards. But given that Alice knows (or is
likely to eventually guess) how her rewards are computed,
instead of undertaking the actions that we would want her
to, Alice could hack Erin, Dave or Charlie’s computations,
so that such hacked computations yield large rewards. This
is sometimes called the wireheading problem.</p>
      <p>Since all this computation starts with Erin’s data
collection, one way for Alice to increase her rewards would be to
feed Erin with fake data that will make Dave infer a deeply
flawed state of the world, which Charlie may regard as ideal.
Worse, Alice may then find out that the best way to do so
would be to invest all of Earth’s resources into
misleading Erin, Dave and Charlie. This could potentially be
extremely bad for mankind. Indeed, especially if Alice cares
about discounted future rewards, she might eventually
regard mankind as a possible threat to her objective.</p>
      <p>This is why it is of the utmost importance that Alice’s
incentives be (partially) aligned with Erin, Dave and Charlie
performing well and being accurate. This will be Bob’s job.
Bob will need to make sure that, while Alice’s rewards do
correlate with Charlie’s scores, they also give Alice the
incentives to guarantee that Erin, Dave and Charlie perform as
reliably as possible the job they were given.</p>
      <p>
        In fact, it even seems desirable that Alice be incentivized
to constantly upgrade Erin, Dave and Charlie for the
better. Ideally, she would even want them to be
computationally more powerful than herself, especially in the long run.
This approach would bear resemblance with the idea of
self-nudge
        <xref ref-type="bibr" rid="ref63">(Thaler and Sunstein 2009)</xref>
        . This corresponds to
strategies that we humans sometimes use to nudge ourselves
(or others) into doing what we want to want to do, rather
than what our latest emotion or laziness invites us to do.
      </p>
      <p>Unfortunately, it seems unclear how Bob can best make
sure that Alice has such incentives. Perhaps a good idea is to
penalize Dave’s reported uncertainty about the likely states
of the world. Typically, Bob should make sure Alice’s
rewards are affected by the reliability of Erin’s data. The more
reliable Erin’s data, the larger Alice’s rewards. Similarly,
when Dave or Charlie feel that their computations are
unreliable, Bob should take note of this and adjust Alice’s rewards
accordingly to motivate Alice to provide larger resources for
Charlie’s computations.</p>
      <p>Now, Bob should also mitigate the desire to retrieve more
reliable data and perform more trustworthy computations
with the fact that such efforts will necessarily require the
exploitation of more resources, probably at the expense of
Charlie’s scores. It is this non-trivial trade-off that Bob will
need to take care of.</p>
      <p>
        Bob’s work might be simplified by some (partial) control
of Alice’s action or world model. Although it seems unclear
so far how, techniques like interactive proofs (IP)
        <xref ref-type="bibr" rid="ref23 ref9">(Babai
1985; Goldwasser, Micali, and Rackoff 1989)</xref>
        or
probabilistically checkable proofs (PCP)
        <xref ref-type="bibr" rid="ref3">(Arora et al. 1998)</xref>
        might be
useful to force Alice to prove its correct behavior. By
requesting such proofs to yield large rewards, Bob might be
able to incentivize Alice’s transparency. All such
considerations make up Bob’s incentive problem.
      </p>
      <p>
        It may or may not be useful to enable Bob to switch off
Alice. It should be stressed though that (safe) interruptibility
is nontrivial, as discussed by
        <xref ref-type="bibr" rid="ref12 ref18 ref19 ref25 ref26 ref33 ref33 ref42 ref51 ref51">(Orseau and Armstrong 2016;
El Mhamdi et al. 2017; Martin, Everitt, and Hutter 2016;
Hadfield-Menell et al. 2016a; 2016b; Wa¨ngberg et al. 2017)</xref>
        among others. In fact, safe interruptibility seem to require
very specific circumstances, e.g. Alice being indifferent to
interruption, Alice being programmed to be suicidal in case
of potential harm or Alice having more uncertainty about
her rewards than Bob being able to take over Alice’s job. It
seems unclear so far how relevant such circumstances will
be to Bob’s control problem over Alice4. Besides, instead of
interrupting Alice, Bob might prefer to guide Alice towards
preferable actions by acting on Alice’s rewards.
      </p>
      <p>On another note, it may be computationally more efficient
for all if, instead of merely transmitting a reward, Bob also
feeds Alice with ”backpropagating signals”, that is,
information not about the reward itself, but about its gradient with
respect to key variables, e.g. Charlie’s score or Erin’s
reliability. Having said this, we leave open the technical question
of how to best design this.</p>
      <p>4Note though that this may be very relevant assuming that there
are several Alices, as will be proposed later on.</p>
    </sec>
    <sec id="sec-8">
      <title>Decentralization</title>
      <p>We have decomposed alignment into 5 components for the
sake of exposition. However, any component will likely have
to be decentralized to gain reliability and scalability. In other
words, instead of having a single Alice, a single Bob, a
single Charlie, a single Dave and a single Erin, it seems
crucial to construct multiple Alices, Bobs, Charlies, Daves and
Erins.</p>
      <p>
        This is key to crash-tolerance. Indeed, a single
computer doing Bob’s job could crash and leave Alice
without reward nor penalty. But if Alice’s rewards are an
aggregate of rewards given by a large number of Bobs, then
even if some of the Bobs crash, Alice’s rewards will remain
mostly the same. But crash-tolerance is likely to be
insufficient. Instead, we should design Byzantine-resilient
mechanisms, that is, mechanisms that still perform correctly
despite the presence of hacked or malicious Bobs. Estimators
with large statistical breakdowns
        <xref ref-type="bibr" rid="ref40">(Lopuhaa, Rousseeuw, and
others 1991)</xref>
        , e.g. (geometric) medians and variants
        <xref ref-type="bibr" rid="ref12">(Blanchard et al. 2017)</xref>
        , may be useful for this purpose.
      </p>
      <p>Evidently, in this Byzantine environment, cryptography,
especially (postquantum?) cryptographical signatures and
hashes, are likely to play a critical role. Typically, Bobs’
rewards will likely need to be signed. More generally, the
careful design of secure communication channels between
the components of the AIs seems key. This may be called
the secure messaging problem.</p>
      <p>
        Another difficulty is the addition of more powerful and
precise Bobs, Charlies, Daves and Erins to the pipeline. It
is not yet clear how to best integrate reliable new comers,
especially given that such new comers may be malicious. In
fact, they may want to first act benevolent to gain
admission. But once they are numerous enough, they could take
over the pipeline and, say, feed Alice with infinite rewards.
This is the upgrade problem, which was recently discussed
by
        <xref ref-type="bibr" rid="ref15 ref34">(Christiano, Shlegeris, and Amodei 2018)</xref>
        who proposed
using numerous weaker AIs to supervise stronger AIs. More
research in this direction is probably needed.
      </p>
      <p>Now, in addition to reliability, decentralization may also
enable different Alices, Bobs, Charlies, Daves and Erins to
focus on specific tasks. This would allow to separate
different problems, which could lead to more optimized solutions
at lower costs. To this end, it may be relevant to adapt
different Alices’ rewards to their specific tasks. Note though that
this could also be a problem, as Alices may enter in
competition with one another like in the prisoner’s dilemma. We
may call it the specialization problem. Again, there seems to
be a lot of new research needed to address this problem.</p>
      <p>Another open question is the extent to which AIs should
be exposed to Bobs’ rewards. Typically, if a small company
creates its own AI, to what extent should this AI be aligned?
It should be noted that this may be computationally very
costly, as it may be hard to separate the signal of interest
to the AI from the noise of Bobs’ rewards. Intuitively, the
more influential an AI is, the more it should be influenced
by Bobs’ rewards. But even if this AI is small, it may be
important to demand that it be influenced by Bobs to avoid any
diffusion of responsibility, i.e. many small AIs that disregard
safety concerns on the ground that they each hardly have any
global impact on the world.</p>
      <p>What makes this nontrivial is that any AI may gain
capability and influence over time. An unaligned weak AI
could eventually become an unaligned human-level AI. To
avoid this, even basic, but potentially unboundedly
selfimproving5 AIs should be given at least a seed of alignment,
which may grow as AIs become more powerful. More
generally, AIs should strike a balance between some original
(possibly unaligned) objective and the importance they give
to alignment. This may be called the alignment burden
assignment problem.</p>
      <p>Figure 2 recapitulates our complete roadmap.</p>
    </sec>
    <sec id="sec-9">
      <title>Non-technical challenges</title>
      <p>Given the difficulty of alignment, its resolution will surely
require solving a large number of non-technical challenges
as well. We briefly mention some of them here.</p>
      <p>Perhaps most important is the lack of respectability that
is sometimes associated with this line of research. For
alignment to be solved, it needs to gain respectability from the
scientific community, and perhaps beyond this community
as well. This is why it seems to be of the utmost importance
that discussions around alignment be carried out carefully to
avoid confusions.</p>
      <p>Evidently, alignment definitely needs much more
manpower, which will require funding and recruiting. It seems
particularly important to attract mathematical talents
towards this line of work. This evidently also raises the
challenge of training as many brilliant minds as possible.</p>
      <p>Finally, questions around AI, AI safety and moral
philosophy are sadly often poorly debated. There often is a lot of
overconfidence, and a lack of well-founded reasoning. For
alignment research to gain momentum, it seems crucial to
make debating more informative, respectful and stimulating.</p>
    </sec>
    <sec id="sec-10">
      <title>Conclusion</title>
      <p>This paper discussed the alignment problem, that is, the
problem of aligning the goals of AIs with human
preferences. It presented a general roadmap to tackle this issue.
Interestingly, this roadmap identifies 5 critical steps, as well
as many relevant aspects of these 5 steps. In other words, we
have presented a large number of hopefully more tractable
subproblems that readers are highly encouraged to tackle.
We hope that combining the solutions to these subproblems
could help to partially address alignment. And we hope that
any reader will be able to better determine how he or she
may best contribute to the global effort6.</p>
      <p>Acknowledgment. The author would like to thank El
Mahdi El Mhamdi, Henrik Aslund, Se´bastien Rouault and
Alexandre Maurer for fruitful discussions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Steinhardt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Christiano,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Schulman</surname>
          </string-name>
          , J.; and Mane´,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Concrete problems in AI safety</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>arXiv preprint arXiv:1606</source>
          .
          <fpage>06565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Arora</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lund</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Motwani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Sudan,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; and Szegedy,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>1998</year>
          .
          <article-title>Proof verification and the hardness of approximation problems</article-title>
          .
          <source>Journal of the ACM (JACM) 45</source>
          (
          <issue>3</issue>
          ):
          <fpage>501</fpage>
          -
          <lpage>555</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Arora</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hazan</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Kale</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>The multiplicative weights update method: a meta-algorithm and applications</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Theory of Computing</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ):
          <fpage>121</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Arrow</surname>
            ,
            <given-names>K. J.</given-names>
          </string-name>
          <year>1950</year>
          .
          <article-title>A difficulty in the concept of social welfare</article-title>
          .
          <source>Journal of political economy 58</source>
          <volume>(4)</volume>
          :
          <fpage>328</fpage>
          -
          <lpage>346</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>5In particular, nonparametric AIs should perhaps be treated differently from parametric ones</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>6Please note that a more complete version of this paper is also available (Hoang 2018b)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Babai</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>1985</year>
          .
          <article-title>Trading group theory for randomness</article-title>
          .
          <source>In Proceedings of the seventeenth annual ACM symposium on Theory of computing</source>
          ,
          <fpage>421</fpage>
          -
          <lpage>429</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Baird</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Hashgraph consensus: fair, fast, byzantine fault tolerance</article-title>
          .
          <source>Technical report, Swirlds Tech Report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Balinski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Laraki</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Majority judgment: measuring, ranking, and electing</article-title>
          . MIT press.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Blanchard</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>El Mhamdi</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guerraoui</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Stainer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Machine learning with adversaries: Byzantine tolerant gradient descent</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>119</volume>
          -
          <fpage>129</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Bloom</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Against Empathy: The Case for Rational Compassion</article-title>
          . Ecco.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Bostrom</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2014</year>
          . Superintelligence: Paths, Dangers, Strategies. OUP Oxford.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Christiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shlegeris</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Supervising strong learners by amplifying weak experts</article-title>
          .
          <source>In review.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Damaskinos</surname>
            ,
            <given-names>G.; El</given-names>
          </string-name>
          <string-name>
            <surname>Mhamdi</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guerraoui</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Patra,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Taziki,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; et al.
          <year>2018</year>
          .
          <article-title>Asynchronous byzantine machine learning (the case of sgd)</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          ,
          <fpage>1153</fpage>
          -
          <lpage>1162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Dwork</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; et al.
          <year>2014</year>
          .
          <article-title>The algorithmic foundations of differential privacy</article-title>
          .
          <source>Foundations and Trends R in Theoretical Computer Science</source>
          <volume>9</volume>
          (
          <issue>3</issue>
          -4):
          <fpage>211</fpage>
          -
          <lpage>407</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>El</given-names>
            <surname>Mhamdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            ;
            <surname>Guerraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ; Hendrikx, H.; and
            <surname>Maurer</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Dynamic safe interruptibility for decentralized multi-agent reinforcement learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>130</volume>
          -
          <fpage>140</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <article-title>Stuhlmu¨ller, A.; and Goodman</article-title>
          ,
          <string-name>
            <surname>N. D.</surname>
          </string-name>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>Learning the preferences of ignorant, inconsistent agents</article-title>
          .
          <source>In AAAI</source>
          ,
          <fpage>323</fpage>
          -
          <lpage>329</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Gibbard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>1973</year>
          .
          <article-title>Manipulation of voting schemes: a general result</article-title>
          .
          <source>Econometrica: journal of the Econometric Society</source>
          <volume>587</volume>
          -601.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Gilmer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Metz</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Faghri</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schoenholz</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          ; Raghu,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Wattenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Adversarial spheres</article-title>
          . arXiv preprint arXiv:
          <year>1801</year>
          .02774.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Goldwasser</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Micali</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Rackoff</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>1989</year>
          .
          <article-title>The knowledge complexity of interactive proof systems</article-title>
          .
          <source>SIAM Journal on computing 18</source>
          <volume>(1)</volume>
          :
          <fpage>186</fpage>
          -
          <lpage>208</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          2014.
          <article-title>Generative adversarial nets</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>2672</volume>
          -
          <fpage>2680</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Hadfield-Menell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dragan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2016a</year>
          .
          <article-title>The off-switch game</article-title>
          .
          <source>arXiv preprint arXiv:1611</source>
          .
          <fpage>08219</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Hadfield-Menell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          ; Abbeel,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Dragan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2016b</year>
          .
          <article-title>Cooperative inverse reinforcement learning</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>3909</volume>
          -
          <fpage>3917</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Haidt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>The righteous mind: Why good people are divided by politics and religion</article-title>
          . Vintage.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Hoang</surname>
            ,
            <given-names>L. N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Soumis</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; and Zaccour,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Measuring unfairness feeling in allocation problems</article-title>
          .
          <source>Omega</source>
          <volume>65</volume>
          :
          <fpage>138</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Hoang</surname>
            ,
            <given-names>L. N.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Strategy-proofness of the randomized condorcet voting system</article-title>
          .
          <source>Social Choice and Welfare</source>
          <volume>48</volume>
          :
          <fpage>679</fpage>
          -
          <lpage>701</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Hoang</surname>
            ,
            <given-names>L. N.</given-names>
          </string-name>
          <year>2018a</year>
          .
          <article-title>A roadmap for the value-loading problem</article-title>
          . arXiv preprint arXiv:
          <year>1809</year>
          .01036.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Hoang</surname>
            ,
            <given-names>L. N.</given-names>
          </string-name>
          <year>2018b</year>
          .
          <article-title>La formule du savoir : une philosophie unifie´e du savoir fonde´e sur le the´ore`me de Bayes</article-title>
          .
          <source>EDP Sciences. English translation forthcoming.</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kairouz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sankar</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and Rajagopal,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Context-aware generative adversarial privacy</article-title>
          .
          <source>Entropy</source>
          <volume>19</volume>
          (
          <issue>12</issue>
          ):
          <fpage>656</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>Institute for Health Metrics and Evaluation (IHME</article-title>
          ), University of Washington.
          <year>2016</year>
          .
          <article-title>Gbd compare data visualization</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Irving</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Christiano,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Ai safety via debate</article-title>
          . arXiv preprint arXiv:
          <year>1805</year>
          .00899.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Jean</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Burke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>W. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lobell</surname>
            ,
            <given-names>D. B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ermon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Combining satellite imagery and machine learning to predict poverty</article-title>
          .
          <source>Science</source>
          <volume>353</volume>
          (
          <issue>6301</issue>
          ):
          <fpage>790</fpage>
          -
          <lpage>794</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Kahneman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Thinking, fast and slow</article-title>
          . Farrar, Straus and Giroux New York.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>Kramer</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guillory</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Hancock</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <article-title>Experimental evidence of massive-scale emotional contagion through social networks</article-title>
          .
          <source>Proceedings of the National Academy of Sciences 201320040.</source>
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>Liou</surname>
          </string-name>
          , C.-Y.;
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , J.-C.; and
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , W.-C.
          <year>2008</year>
          .
          <article-title>Modeling word perception using the elman network</article-title>
          .
          <source>Neurocomputing</source>
          <volume>71</volume>
          (
          <fpage>16</fpage>
          -18):
          <fpage>3150</fpage>
          -
          <lpage>3157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Lopuhaa</surname>
            ,
            <given-names>H. P.</given-names>
          </string-name>
          ; Rousseeuw,
          <string-name>
            <surname>P. J.</surname>
          </string-name>
          ; et al.
          <year>1991</year>
          .
          <article-title>Breakdown points of affine equivariant estimators of multivariate location and covariance matrices</article-title>
          .
          <source>The Annals of Statistics</source>
          <volume>19</volume>
          (
          <issue>1</issue>
          ):
          <fpage>229</fpage>
          -
          <lpage>248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>Lowd</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Meek</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Adversarial learning</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          ,
          <fpage>641</fpage>
          -
          <lpage>647</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Everitt,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; and Hutter,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Death and suicide in universal artificial intelligence</article-title>
          .
          <source>In Artificial General Intelligence</source>
          . Springer.
          <fpage>23</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <string-name>
            <surname>Mhamdi</surname>
            ,
            <given-names>E. M. E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guerraoui</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Rouault</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <article-title>The hidden vulnerability of distributed learning in byzantium</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          ,
          <fpage>3518</fpage>
          -
          <lpage>3527</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Corrado</surname>
          </string-name>
          , G.; and
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <source>arXiv preprint arXiv:1301</source>
          .
          <fpage>3781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <surname>Nakamoto</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Bitcoin: A peer-to-peer electronic cash system</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>J. v.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Morgenstern</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>1944</year>
          .
          <article-title>Theory of games and economic behavior</article-title>
          . Princeton: Princeton.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          ; et al.
          <year>2000</year>
          .
          <article-title>Algorithms for inverse reinforcement learning</article-title>
          .
          <source>In Icml</source>
          ,
          <volume>663</volume>
          -
          <fpage>670</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <surname>Organization</surname>
            ,
            <given-names>W. H.</given-names>
          </string-name>
          , et al.
          <year>2009</year>
          .
          <article-title>Death and daly estimates for 2004 by cause for who member states</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <string-name>
            <surname>Orseau</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Armstrong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Safely interruptible agents</article-title>
          .
          <source>In Uncertainty in Artificial Intelligence: 32nd Conference (UAI</source>
          <year>2016</year>
          ), edited by Alexander
          <source>Ihler and Dominik Janzing</source>
          ,
          <fpage>557</fpage>
          -
          <lpage>566</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <string-name>
            <surname>Ricci</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rokach</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Shapira</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Recommender systems: introduction and challenges</article-title>
          .
          <source>In Recommender systems handbook. Springer. 1-34.</source>
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dewey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Tegmark,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Research priorities for robust and beneficial artificial intelligence</article-title>
          .
          <source>AI</source>
          Magazine
          <volume>36</volume>
          (
          <issue>4</issue>
          ):
          <fpage>105</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <string-name>
            <surname>Satterthwaite</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <year>1975</year>
          .
          <article-title>Strategy-proofness and arrow's conditions: Existence and correspondence theorems for voting procedures and social welfare functions</article-title>
          .
          <source>Journal of economic theory 10</source>
          <volume>(2)</volume>
          :
          <fpage>187</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <string-name>
            <surname>Simpson</surname>
            ,
            <given-names>E. H.</given-names>
          </string-name>
          <year>1951</year>
          .
          <article-title>The interpretation of interaction in contingency tables</article-title>
          .
          <source>Journal of the Royal Statistical Society.</source>
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          <string-name>
            <surname>Series B (Methodological</surname>
          </string-name>
          )
          <fpage>238</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fallenstein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Agent foundations for aligning machine intelligence with human interests: a technical research agenda</article-title>
          .
          <source>In The Technological Singularity.</source>
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          Springer.
          <fpage>103</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Aligning superintelligence with human interests: An annotated bibliography</article-title>
          .
          <source>Intelligence</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ):
          <fpage>391</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>The value learning problem</article-title>
          .
          <source>In Ethics for Artificial IntelligenceWorkshop at 25th International Joint Conference on Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vargas</surname>
            ,
            <given-names>D. V.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kouichi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>One pixel attack for fooling deep neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1710</source>
          .
          <fpage>08864</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          <string-name>
            <surname>Tegmark</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Life 3.0. Being Human in the Age of Artificial Intelligence</article-title>
          . NY: Allen Lane.
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          <string-name>
            <surname>Thaler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sunstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Nudge: Improving Decisions About Health, Wealth, and</article-title>
          <string-name>
            <given-names>Happiness. Penguin</given-names>
            <surname>Books</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          2017.
          <article-title>A game-theoretic analysis of the off-switch game</article-title>
          .
          <source>In International Conference on Artificial General Intelligence</source>
          ,
          <fpage>167</fpage>
          -
          <lpage>177</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          <string-name>
            <surname>Yudkowsky</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Coherent extrapolated volition</article-title>
          .
          <source>Singularity Institute for Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>