Towards Robust End-to-End Alignment

                                                         Lê Nguyên Hoang
                                                              EPFL
                                          Chemin Alan Turing, Lausanne 1015, Switzerland


                           Abstract                                 entirety. This motivated us to propose a roadmap for robust
                                                                    end-to-end alignment.
  Robust alignment is arguably both critical and extremely             While much of our proposal is speculative, we believe that
  challenging. Loosely, it is the problem of designing algorith-
  mic systems with strong guarantees of always being bene-
                                                                    several of the ideas presented here will be critical for AI
  ficial for mankind. In this paper, we propose a preliminary       safety and alignment. More importantly, we hope that this
  research program to address it in a reinforcement learning        will be a useful roadmap for both AI experts and non-experts
  framework. This roadmap aims at decomposing the end-to-           to better estimate how they can best contribute to the effort.
  end alignment problem into numerous more tractable sub-              Given the complexity of the problem, our roadmap here
  problems. We hope that each subproblem is sufficiently or-        will likely be full of gaps and false good ideas. It is important
  thogonal to others to be tackled independently, and that com-     to note that our purpose is not to propose a definite perfect
  bining the solutions to all such subproblems may yield a so-      solution. Rather, we aim at presenting a sufficiently good
  lution to alignment.                                              starting point for others to build upon.

                       Introduction                                                       The Roadmap
As they are becoming more and more capable and ubiq-                Our roadmap consists of identifying key steps to alignment.
uitous, AIs are raising numerous concerns, including fair-          For the sake of exposition, these steps will be personified
ness, privacy, filter bubbles, addiction, job displacement or       by 5 characters, called Alice, Bob, Charlie, Dave and Erin.
even existential risks (Russell, Dewey, and Tegmark 2015;           Roughly speaking, Erin will be collecting data from the
Tegmark 2017). It has been argued that aligning the goals           world, Dave will use these data to infer the likely states of
of AI systems with humans’ preferences would be an ef-              the world, Charlie will compute the desirability of the likely
ficient way to make them reliably beneficial and to avoid           states of the world, Bob will derive incentive-compatible re-
potential catastrophic risks (Bostrom 2014; Hoang 2018a).           wards to motivate Alice to take the right decision, and Al-
In fact, given the global influence of today’s large-scale rec-     ice will optimize decision-making. This decomposition is
ommender systems (Kramer, Guillory, and Hancock 2014),              graphically represented in Figure 1.
it already seems urgent to propose even partial solutions to
alignment.
   Unfortunately, it has also been argued that alignment is
an extremely difficult problem. In fact, (Bostrom 2014) ar-
gues that it “is a research challenge worthy of some of the
next generation’s best mathematical talent”. To address it,         Figure 1: We decompose the alignment problem into 5 key
the Future of Life Institute proposed a landscape of AI safety      steps: data collection, world model inference, desirability
research1 . Meanwhile, (Soares 2015; Soares and Fallenstein         learning, incentive design and reinforcement learning.
2017) listed important ideas in this line of work. We hope
that this paper will contribute to outline the main challenges         Evidently, Alice, Bob, Charlie, Dave and Erin need not be
posed by alignment.                                                 5 different AIs. Typically, it may be much more computa-
   In particular, we shall introduce a complete research pro-       tionally efficient to merge Charlie and Dave. Nevertheless,
gram to robustly align AIs. Robustness here refers to numer-        at least for pedagogical reasons, it seems useful to first dis-
ous possible failure modes, including overfitting, hazardous        sociate the different roles that these AIs have.
exploration, evasion attacks, poisoning attacks, crash toler-          In the sequel, we shall further detail the challenges posed
ance, Byzantine resilience, reward hacking and wireheading.         by each of the 5 AIs. We shall also argue that, for robustness
To guarantee such a robustness, we argue that it is desirable       and scalability reasons, these AIs will need to be further di-
to structure (at least conceptually) our AI systems in their        vided into many more AIs. We will see that this raises ad-
                                                                    ditional challenges. We shall also make a few non-technical
   1
       https://futureoflife.org/landscape/                          remarks, before concluding.
           Alice’s Reinforcement learning                           measures that are proposed should not be too constrain-
It seems that today’s most promising framework for large-           ing. In other words, there are constraints on the safety con-
scale AIs is that of reinforcement learning. In reinforcement       straints that can be imposed. This is what makes AI safety
learning, an AI can be regarded as a decision-making pro-           so challenging.
cess. At time t, the AI observes some state of the world st .           As a result, what is perhaps more interesting are the ideas
Depending on its inner parameters θt , it then takes (possibly      proposed by (Amodei et al. 2016) to make reinforcement
randomly) some action at .                                          learning safer, especially using model lookahead. This es-
    The decision at then influences the next state and turns it     sentially corresponds to Alice simulating many likely sce-
into st+1 . The transition from st to st+1 given action at is       narii before undertaking any action. More generally, Alice
usually assumed to be nondeterministic. In any case, the AI         faces a safe exploration problem.
then receives a reward Rt+1 . The internal parameters θt of             But this is not all. Given that AIs will likely be based on
the AI may then be updated into θt+1 , depending on previ-          machine learning, and given the lack of verification meth-
ous parameters θt , action at , state st+1 and reward Rt+1 .        ods for AIs obtained by machine learning, we should not
    Note that this is a very general framework. In fact, we hu-     expect AIs to be correct all the time. Just like humans, AIs
mans are arguably (at least partially) subject to this frame-       will likely be sometimes wrong. But this is extremely wor-
work. At any point in time, we observe new data st that in-         rysome. Indeed, even if an AI is right 99.9999% of the time,
forms us about the world. Using an inner model of the world         it will still be wrong one time out of a million. Yet, AIs like
θt , we then infer what the world probably is like, which mo-       recommender systems or autonomous cars take billions of
tivates us to take some action at . This may affect what likely     decisions every day. In such cases, thousands of AI deci-
next data st+1 will be observed, and may be accompanied             sions may be unboundedly wrong every day!
with a rewarding (or painful) feeling Rt+1 , which will moti-           This problem can become even more worrysome if we
vate us to update our inner model of the world θt into θt+1 .       take into account the fact that hackers may attempt to take
    Let us call Alice the AI in charge of performing this re-       advantage of the AIs’ deficiencies. Such hackers may typi-
inforcement learning reasoning. Alice can thus be viewed as         cally submit only data that corresponds to cases where the
an algorithm, which inputs observed states st and rewards           AIs are wrong. This is known as evasion attacks (Lowd and
Rt , and undertakes actions at so as to typically maximize          Meek 2005; Su, Vargas, and Kouichi 2017; Gilmer et al.
some discounted sum of expected future rewards.                     2018). To avoid evasion attacks, it is crucial for an AI to
    Such actions will probably be mostly of the form of mes-        never be unboundedly wrong, e.g. by reliably measuring its
sages sent through the Internet. This may sound benign. But         own confidence in its decisions and to ask for help in cases
it is not. The YouTube recommender system might suggest             of great uncertainty.
billions of antivax videos, causing a major decrease of vac-            Now, even if Alice is well-designed, she will only be an
cination and an uprise of deadly diseases. Worse, if an AI is       effective optimization algorithm. Unfortunately, this is no
in control of 3D-printers, then a message that tells them to        guarantee of safety or alignment. Typically, because of hu-
construct killer drones to cause a genocide would be catas-         mans’ well-known addiction to echo chambers (Haidt 2012),
trophic. On a brighter note, if an AI now promotes convinc-         a watch-time maximization YouTube recommender AI may
ing eco-friendly messages every day to billions of people,          amplify filter bubbles, which may lead to worldwide geopo-
the public opinion on climate change may greatly change.            litical tensions. Both misaligned and unaligned AIs will
    Note that, as opposed to all other components, in some          likely lead to very undesirable consequences.
sense, Alice is the real danger. Indeed, in our framework,              In fact, (Bostrom 2014) even argues that, to best reach its
she is the only one that really undertakes actions. More pre-       goals, any sufficiently strategic AI will likely first aim at so-
cisely, only her actions will be unconstrained (although oth-       called instrumental goals, e.g. gaining vastly more resources
ers highly influence her decision-making and are thus criti-        and guaranteeing self-preservation. But this is very unlikely
cal as well).                                                       to be in humans’ best interests. In particular, it will likely
    As a result, it is of the utmost importance that Alice be       motivate the AI to undertake actions that we would not re-
well-designed. Some of the past work (Orseau and Arm-               gard as desirable.
strong 2016; El Mhamdi et al. 2017) have proposed to re-                To make sure that Alice will want to behave as we want
strict the learning capabilities of Alice to provide provable       her to, it seems critical to at least partially control the ob-
desirable properties. Typically, they proposed to allow only        served state st+1 or the reward Rt+1 . Note that this is simi-
a subclass of learning algorithms, i.e. of update rules of θt+1     lar to the way children are taught to behave. We do so by ex-
as a function of (θt , at , st+1 , Rt+1 ). However, such restric-   posing them to specific observed states, by punishing them
tions might be too costly. And this may be a big problem.           when the sequence (st , at , st+1 ) is undesirable, and by re-
    Indeed, there is already a race between competing com-          warding them when the sequence (st , at , st+1 ) is desirable.
panies in competing countries to construct powerful AIs.                Whether or not Alice’s observed state st is constrained,
While it might be possible for some countries to impose             her rewards Rt are clearly critical. They are her incentives,
some restrictions to some AIs of some companies, it is un-          and will thus determine her decision-making. Unfortunately,
likely that all companies of all countries will accept to be        determining the adequate rewards Rt to be given to Alice is
restricted, especially if the restrictions are too constraining.    an extremely difficult problem. It is, in fact, the key to align-
In fact, AI safety will be useful only if the most powerful         ment. Our roadmap to solve it identifies 4 key steps incar-
AIs are all subject to safety measures. As a result, the safety     nated by Erin, Dave, Charlie and Bob.
           Erin’s data collection problem                              Given how crucial it is for Dave to have an unbiased rep-
In order to do good, it is evidently crucial to be given a lot of   resentation of the world, much care will be needed to make
reliable data. Indeed, even the most brilliant mind will be un-     sure that Dave’s inference will foresee selection biases. For
able to know anything about the world if it does not have any       instance, when asked to provide images of CEOs, Google
data from that world. This is particularly true when the goal       Image may return a greater ratio of male CEOs than the ac-
is to undertake desirable actions, or to make sure that one’s       tual ratio. More generally, such biases can be regarded as
action will not have potentially catastrophic consequences.         instances of Simpson’s paradox (Simpson 1951), and boil
    Evidently, much data is already available on the Internet.      down to the saying ”correlation is not causation”. It seems
It is likely that any large-scale AI will have access to the In-    crucial that Dave does not fall into this trap.
ternet, as is already the case of the Facebook recommender             In fact, data can be worse than unintentionally misleading.
system. However, it is important to take into account the           Given how influential Alice may be, there will likely be great
fact that the data on the Internet is not always fully reliable.    incentives for many actors to bias Erin’s data gathering, and
It may be full of fake news, fraudulent entries, misleading         to thus fool Dave. This is known as poisoning attacks (Blan-
videos, hacked posts and corrupted files.                           chard et al. 2017; Mhamdi, Guerraoui, and Rouault 2018;
    It may then be relevant to invest in more reliable and rel-     Damaskinos et al. 2018). It seems extremely important that
evant data collection. This would be Erin’s job. Typically,         Dave anticipate the fact that the data he was given may be
Erin may want to collect economic metrics to better assess          purposely biased, if not hacked. Like any good journalist,
needs. Recently, it has been shown that satellite images com-       Dave will likely need to cross information from different
bined with deep learning allow to compute all sorts of use-         sources to infer the most likely states of the world.
ful economic indicators (Jean et al. 2016), including poverty          This inference approach is well captured by the Bayesian
risks and agricultural productivity. It is possible that the use    paradigm (Hoang 2018b). In particular, Bayes rule is de-
of still more sensors can further increase our capability to        signed to infer the likely causes of the observed data D.
improve life standards, especially in developing countries.         These causes can also be regarded as theories T (and such
    To guarantee the reliability of such data, cryptographic        theories may assume that some of the data were hacked).
and distributed computing solutions are likely to be use-           Bayes rule tells us that the reliability of theory T given data
ful as well, as they already are on the web. In particu-            D can be derived formally by the following computation:
lar, distributed computing, combined with recent Byzantine-
                                                                                                   P[D|T ]P[T ]
resilient consensus algorithms like Blockchain (Nakamoto                              P[T |D] =                 .
2008) or Hashgraph (Baird 2016), could guarantee the reli-                                            P[D]
able storage and traceability of critical information.                 One typical instance of Dave’s job is the problem of infer-
    Note though that such data collection mechanisms could          ring global health from a wide variety of collected data. This
pose major privacy issues. It is a major current challenge          is what has been done by (Institute for Health Metrics and
to balance the usefulness of collected data and the privacy         Evaluation (IHME), University of Washington 2016), using
violation they inevitably cause. Some possible solutions in-        a sophisticated Bayesian model that reconstructed the likely
clude differential privacy (Dwork, Roth, and others 2014), or       causes of deaths in countries where data were lacking.
weaker versions like generative-adversarial privacy (Huang             Importantly, Bayes rule also tells us that we should not
et al. 2017). It could also be possible to combine these with       fully believe any single theory. This simply corresponds to
more cryptographic solutions, like homomorphic encryption           saying that data can often be interpreted in many different
or multi-party computation. It is interesting that such cryp-       mutually incompatible manners. It seems important to rea-
tographic solutions may be (essentially) provably robust to         son with all possible interpretations rather than isolating a
any attacker, including a superintelligence2 .                      single interpretation that may be flawed.
                                                                       When the space of possible states of the world is large,
            Dave’s world model problem                              which will surely be the case of Dave, it is often computa-
Unfortunately, raw data are usually extremely messy, redun-         tionally intractable to reason with the full posterior distribu-
dant, incomplete, unreliable, poisoning and even hacked. To         tion P[T |D]. Bayesian methods often rather propose to sam-
tackle these issues, it is necessary to infer the likely actual     ple from the posterior distribution to identify a reasonable
states of the world, given Erin’s collected data. This will be      number of good interpretations of the data. These sampling
Dave’s job.                                                         methods include Monte-Carlo methods, as well as Markov-
   The overarching principle of Dave’s job is probably going        Chain Monte-Carlo (MCMC) ones.
to be some deep representation learning. This corresponds              In some sense, Dave’s job can be regarded as writing a
to determining low-dimensional representations of high-             compact report of all likely states of the world, given Erin’s
dimensional data. This basic idea has given rise to today’s         collected data. It is an open question as of what language
most promising unsupervised machine learning alogrithms,            Dave’s report will be in. It might be useful to make it under-
e.g. word vectors (Mikolov et al. 2013), autoencoders (Liou,        standable by humans. But it might be too costly as well. In-
Huang, and Yang 2008) and generative adversarial net-               deed, Dave’s report might be billions of pages long. It could
works (GANs) (Goodfellow et al. 2014).                              be unreasonable or undesirable to make it humanly readable.
                                                                       Note also that Erin and Dave are likely to gain cogni-
   2                                                                tive capabilities over time. It is surely worthwhile to an-
     The possible use of quantum computers may require postquan-
tum cryptography.                                                   ticipate the complexification of Erin’s data and of Dave’s
world models. It seems unclear so far how to do so. Some             This is known as a social choice problem. In its general
high-level (purely descriptive) language to describe world        form, it is the problem of aggregating the preferences of a
models is probably needed. In addition, this high-level lan-      group of disagreeing people into a single preference for the
guage may need to be flexible enough to be reshaped and re-       whole group that, in some sense, fairly well represents the
designed over time. This may be dubbed the world descrip-         individuals’ preferences. Unfortunately, social choice theory
tion problem. It is arguably still a very open and uncharted      is plagued with impossibility results, e.g. Arrow’s theorem
area of research.                                                 (Arrow 1950) or the Gibbard-Satterthwaite theorem (Gib-
                                                                  bard 1973; Satterthwaite 1975). Again, we should not be too
     Charlie’s desirability learning problem                      demanding regarding the properties of our preference aggre-
                                                                  gation. Besides, this is the path taken by social choice theory,
Given any of Dave’s world models, Charlie’s job will then
                                                                  e.g. by proposing randomized solutions to preserve some de-
be to compute how desirable this world model is. This is
                                                                  sirable properties (Hoang 2017).
the desirability learning problem (Soares 2016), also known
as value learning3 . This is the problem of assigning desir-         One particular proposal, known as majority judgment
ability scores to different world models. These desirability      (Balinski and Laraki 2011), may be of particular interest to
scores can then serve as the basis for any agent to determine     us here. Its basic idea is to choose some deciding quantile
beneficial actions.                                               q ∈ [0, 1] (often taken to be q = 1/2). Then, for any pos-
   Unfortunately, determining what, say, the median human         sible state of the world, consider all individuals’ desirability
considers desirable is an extremely difficult problem. But        scores for that state. This yields a distribution of humans’
again, it should be stressed that we should not aim at deriv-     preferences for the state of the world. Majority judgment
ing an ideal inference of what people desire. This is likely      then concludes that the group’s score is the quantile q of
to be a hopeless endeavor. Rather, we should try our best         this distribution. If q = 1/2, this corresponds to the score
to make sure that Charlie’s desirability scores will be good      chosen by the median individual of the group.
enough to avoid catastrophic outcomes, e.g. world destruc-           Now, to avoid an oppression of a majority over some mi-
tion, global sufferance or major discrimination.                  nority, it might be relevant to choose a small value of q, say
   One proposed solution to infer human preferences is so-        q = 0.1. This would mean that Charlie’s scoring of a state
called inverse reinforcement learning (Ng, Russell, and oth-      of the world will be less than a number score, if more than
ers 2000; Evans, Stuhlmüller, and Goodman 2016). Assum-          10% of the people believe that this state should be given a
ing that humans perform reinforcement learning to choose          score less than score. But evidently, this point is very much
their actions, and given examples of actions taken by hu-         debatable. It seems unclear so far how to best choose q.
mans in different contexts, inverse reinforcement learning           While majority judgment seems to be a promising ap-
infers what were the humans’ likely implicit rewards that         proach, it does raise the question of how to compare two dif-
motivated their decision-making. Assuming we can some-            ferent individuals’ scores. It is not clear that score = 5 given
how separate humans’ selfish rewards from altruistic ones,        by John has a meaning comparable to Jane’s score = 5. In
inverse reinforcement learning seems to be a promising first      fact, according to a theorem by von Neumann and Morgen-
step towards inferring humans’ preferences from data. There       stern (Neumann and Morgenstern 1944), within their frame-
are, however, many important considerations to be taken into      work, utility functions are only defined up to a positive affine
account, which we discuss below.                                  transformation. More work is probably needed to determine
   First, it is important to keep in mind that, despite Dave’s    how to scale different individuals’ utility functions appro-
effort and because of Erin’s limited and possibly biased data     priately, despite previous attempts in special cases (Hoang,
collection, Dave’s world model is fundamentally uncertain.        Soumis, and Zaccour 2016). Again, it should be stressed that
In fact, as discussed previously, Dave would probably rather      we should not aim at an ideal solution; a workable reason-
present a distribution of likely world models. Charlie’s job      able solution is much better than no solution at all.
should be regarded as a scoring of all such likely world mod-        Now, arguably, humans’ current preferences are almost
els. In particular, she should not assign a single number to      surely undesirable. Indeed, over the last decades, psychol-
the current state of the world, but, rather, a distribution of    ogy has been showing again and again that human think-
likely scores of the current state of the world. This distribu-   ing is full of inconsistencies, fallacies and cognitive biases
tion should convey the uncertainty about the actual state of      (Kahneman 2011). We tend to first have instinctive reactions
the world. Besides, as we shall see, this uncertainty is likely   to stories or facts (Bloom 2016), which quickly becomes the
to be crucial for Bob to choose incentive-compatible rewards      position we will want to defend at all costs (Haidt 2012).
for Alice adequately.                                             Worse, we are unfortunately largely unaware of why we be-
   Another challenging aspect of Charlie’s job will be to pro-    lieve or want what we believe or want. This means that our
vide a useful representation of potential human disagree-         current preferences are unlikely to be what we would prefer,
ments about the desirability of different states of the world.    if we were more informed, thought more deeply, and tried to
Humans’ preferences are diverse and may never converge.           make sure our preferences were as well-founded as possible.
This should not be swept under the rug. Instead, we need to          And arguably, we should prefer what we would prefer to
agree on some way to mitigate disagreement.                       prefer, rather than what we instinctively prefer. Typically,
                                                                  one might prefer to watch a cat video, even though one might
   3                                                              prefer to prefer mathematics videos over cat videos. Desir-
     To avoid raising eyebrows, we shall try to steer away from
polarizing terminologies like values, moral or ethics.            ablity scores should arguably encode what we would prefer
to prefer, rather than what we instinctively prefer.              our best to describe, informally and formally, what better
   To understand, a thought experiment may be useful. Let         versions of ourselves would likely regard as desirable. Let
us imagine better versions of us. Each current me is thereby      us try to predict the volition of me++ ’s.
associated with a me++ . A me++ is what current me would             This attempt is likely going to be shocking to us all. In-
desire, if current me were smarter, thought much longer           deed, we should expect that better versions of ourselves will
about what he finds desirable, and analyzed all imaginable        find desirable things that the current versions of ourselves
data of the world. Arguably, me++ ’s desirability score is        find repelling. Unfortunately though, we humans tend to re-
“more right” than current me’s.                                   act poorly to disagreeing jugments. And this is likely to hold
   This can be illustrated by the fact that past standards are    even when the oppositions are our better selves. This poses
often no longer regarded as desirable. Our intuitions about       a great scientific and engineering challenge. How can one be
the desirability of slavery, homosexuality and gender dis-        best convinced of the judgments that he or she will eventu-
crimination have been completely upset over the last cen-         ally embrace but does not yet? In other words, how can we
tury, if not over the last few decades. It seems unlikely that    quickly agree with better versions of ourselves? What could
all of our other intuitions will never change. In particular,     someone else say to get me closer to my me++ ? This may
it seems unlikely that me++ will fully agree with current         be dubbed the individual improvement problem.
me. And it seems reasonable to argue that me++ would be              To address this issue, (Irving, Christiano, and Amodei
“more right” than current me.                                     2018) have discussed the possibility of setting up a debate
   These remarks are the basis of coherent extrapolated voli-     between opposing AIs. In particular, they asked whether a
tion (Yudkowsky 2004). The basic idea is that we should aim       human judge would be able to lean towards the better AI
at the preferences that future versions of ourselves would        for the right reasons. Interestingly, such a debate might al-
eventually adopt, if they were vastly more informed, had          low for significantly more powerful “proofs of superiority”
much more time to ponder what they regard as desirable,           than monologues, at least if the analogy with the so-called
and tried their best to be better versions of themselves. In      polynomial hierarchy of complexity theory holds.
some sense, instead of making current me’s debate about              This question is critical for alignment as it will likely be
what’s desirable (which often turns into a pointless debacle),    a key challenge to build trust in the systems we design. But
we should let me++ ’s debate. In fact, since me++ ’s suppos-      evidently, this is a more general question that should be of
edly already know everything about other me++ ’s, there is        interest to anyone who desires to do good.
actually no point in getting them to debate. It suffices to ag-
gregate their preferences through some social choice mech-                        Bob’s incentive design
anism. This is the preference aggregation problem.                The last piece of the jigsaw is Bob’s job. Bob is in charge of
   It is noteworthy that we clearly have epistemic uncertainty    computing the rewards that Alice will receive, based on the
about me++ ’s. Determining me++ ’s desirability scores            work of Erin, Dave and Charlie. Evidently he could simply
may be called the coherent extrapoled individual volition         compute the expectation of Charlie’s scores for the likely
problem. Interestingly, this is (mostly) a prediction problem.    states of the world. But this is probably a bad idea, as it
But it is definitely too ambitious to predict them with ab-       opens the door to reward hacking.
solute uncertainty. Bayes rule tells us that we should rather        Recall that Alice’s goal is to maximize her discounted
describe these desirability scores by a probability distribu-     expected future rewards. But given that Alice knows (or is
tions of likely desirability scores.                              likely to eventually guess) how her rewards are computed,
   Such scores could also be approximated using a large           instead of undertaking the actions that we would want her
number of proxies, as is done by boosting methods (Arora,         to, Alice could hack Erin, Dave or Charlie’s computations,
Hazan, and Kale 2012). The use of several proxies could           so that such hacked computations yield large rewards. This
avoid the overfitting of any proxy. Typically, rather than re-    is sometimes called the wireheading problem.
lying solely on DALYs (Organization and others 2009), we             Since all this computation starts with Erin’s data collec-
probably should invoke machine learning methods to com-           tion, one way for Alice to increase her rewards would be to
bine a large number of similar metrics, especially those that     feed Erin with fake data that will make Dave infer a deeply
aim at describing other desirable economic metrics, like hu-      flawed state of the world, which Charlie may regard as ideal.
man development index (HDI) or gross national happiness           Worse, Alice may then find out that the best way to do so
(GNH). Still another approach may consist of analyzing            would be to invest all of Earth’s resources into mislead-
“typical” human preferences, e.g. by using collaborative fil-     ing Erin, Dave and Charlie. This could potentially be ex-
tering techniques (Ricci, Rokach, and Shapira 2015). Evi-         tremely bad for mankind. Indeed, especially if Alice cares
dently, much more research is needed along these lines.           about discounted future rewards, she might eventually re-
   Computing the desirability of a given world state is Char-     gard mankind as a possible threat to her objective.
lie’s job. In some sense, Charlie’s job would thus be to re-         This is why it is of the utmost importance that Alice’s in-
move cognitive biases from our intuitive preferences, so that     centives be (partially) aligned with Erin, Dave and Charlie
they still basically reflect what we really regard as prefer-     performing well and being accurate. This will be Bob’s job.
able, but in a more coherent and informed manner. This is an      Bob will need to make sure that, while Alice’s rewards do
incredibly difficult problem, which will likely take decades      correlate with Charlie’s scores, they also give Alice the in-
to sort out reasonably well. This is why it is of the utmost      centives to guarantee that Erin, Dave and Charlie perform as
importance that it be started as soon as possible. Let us try     reliably as possible the job they were given.
   In fact, it even seems desirable that Alice be incentivized                              Decentralization
to constantly upgrade Erin, Dave and Charlie for the bet-              We have decomposed alignment into 5 components for the
ter. Ideally, she would even want them to be computation-              sake of exposition. However, any component will likely have
ally more powerful than herself, especially in the long run.           to be decentralized to gain reliability and scalability. In other
This approach would bear resemblance with the idea of                  words, instead of having a single Alice, a single Bob, a sin-
self-nudge (Thaler and Sunstein 2009). This corresponds to             gle Charlie, a single Dave and a single Erin, it seems cru-
strategies that we humans sometimes use to nudge ourselves             cial to construct multiple Alices, Bobs, Charlies, Daves and
(or others) into doing what we want to want to do, rather              Erins.
than what our latest emotion or laziness invites us to do.                This is key to crash-tolerance. Indeed, a single com-
   Unfortunately, it seems unclear how Bob can best make               puter doing Bob’s job could crash and leave Alice with-
sure that Alice has such incentives. Perhaps a good idea is to         out reward nor penalty. But if Alice’s rewards are an ag-
penalize Dave’s reported uncertainty about the likely states           gregate of rewards given by a large number of Bobs, then
of the world. Typically, Bob should make sure Alice’s re-              even if some of the Bobs crash, Alice’s rewards will remain
wards are affected by the reliability of Erin’s data. The more         mostly the same. But crash-tolerance is likely to be insuf-
reliable Erin’s data, the larger Alice’s rewards. Similarly,           ficient. Instead, we should design Byzantine-resilient mech-
when Dave or Charlie feel that their computations are unreli-          anisms, that is, mechanisms that still perform correctly de-
able, Bob should take note of this and adjust Alice’s rewards          spite the presence of hacked or malicious Bobs. Estimators
accordingly to motivate Alice to provide larger resources for          with large statistical breakdowns (Lopuhaa, Rousseeuw, and
Charlie’s computations.                                                others 1991), e.g. (geometric) medians and variants (Blan-
   Now, Bob should also mitigate the desire to retrieve more           chard et al. 2017), may be useful for this purpose.
reliable data and perform more trustworthy computations                   Evidently, in this Byzantine environment, cryptography,
with the fact that such efforts will necessarily require the           especially (postquantum?) cryptographical signatures and
exploitation of more resources, probably at the expense of             hashes, are likely to play a critical role. Typically, Bobs’
Charlie’s scores. It is this non-trivial trade-off that Bob will       rewards will likely need to be signed. More generally, the
need to take care of.                                                  careful design of secure communication channels between
   Bob’s work might be simplified by some (partial) control            the components of the AIs seems key. This may be called
of Alice’s action or world model. Although it seems unclear            the secure messaging problem.
so far how, techniques like interactive proofs (IP) (Babai                Another difficulty is the addition of more powerful and
1985; Goldwasser, Micali, and Rackoff 1989) or probabilis-             precise Bobs, Charlies, Daves and Erins to the pipeline. It
tically checkable proofs (PCP) (Arora et al. 1998) might be            is not yet clear how to best integrate reliable new comers,
useful to force Alice to prove its correct behavior. By re-            especially given that such new comers may be malicious. In
questing such proofs to yield large rewards, Bob might be              fact, they may want to first act benevolent to gain admis-
able to incentivize Alice’s transparency. All such considera-          sion. But once they are numerous enough, they could take
tions make up Bob’s incentive problem.                                 over the pipeline and, say, feed Alice with infinite rewards.
                                                                       This is the upgrade problem, which was recently discussed
   It may or may not be useful to enable Bob to switch off
                                                                       by (Christiano, Shlegeris, and Amodei 2018) who proposed
Alice. It should be stressed though that (safe) interruptibility
                                                                       using numerous weaker AIs to supervise stronger AIs. More
is nontrivial, as discussed by (Orseau and Armstrong 2016;
                                                                       research in this direction is probably needed.
El Mhamdi et al. 2017; Martin, Everitt, and Hutter 2016;
                                                                          Now, in addition to reliability, decentralization may also
Hadfield-Menell et al. 2016a; 2016b; Wängberg et al. 2017)
                                                                       enable different Alices, Bobs, Charlies, Daves and Erins to
among others. In fact, safe interruptibility seem to require
                                                                       focus on specific tasks. This would allow to separate differ-
very specific circumstances, e.g. Alice being indifferent to
                                                                       ent problems, which could lead to more optimized solutions
interruption, Alice being programmed to be suicidal in case
                                                                       at lower costs. To this end, it may be relevant to adapt differ-
of potential harm or Alice having more uncertainty about
                                                                       ent Alices’ rewards to their specific tasks. Note though that
her rewards than Bob being able to take over Alice’s job. It
                                                                       this could also be a problem, as Alices may enter in com-
seems unclear so far how relevant such circumstances will
                                                                       petition with one another like in the prisoner’s dilemma. We
be to Bob’s control problem over Alice4 . Besides, instead of
                                                                       may call it the specialization problem. Again, there seems to
interrupting Alice, Bob might prefer to guide Alice towards
                                                                       be a lot of new research needed to address this problem.
preferable actions by acting on Alice’s rewards.
                                                                          Another open question is the extent to which AIs should
   On another note, it may be computationally more efficient           be exposed to Bobs’ rewards. Typically, if a small company
for all if, instead of merely transmitting a reward, Bob also          creates its own AI, to what extent should this AI be aligned?
feeds Alice with ”backpropagating signals”, that is, informa-          It should be noted that this may be computationally very
tion not about the reward itself, but about its gradient with          costly, as it may be hard to separate the signal of interest
respect to key variables, e.g. Charlie’s score or Erin’s relia-        to the AI from the noise of Bobs’ rewards. Intuitively, the
bility. Having said this, we leave open the technical question         more influential an AI is, the more it should be influenced
of how to best design this.                                            by Bobs’ rewards. But even if this AI is small, it may be im-
                                                                       portant to demand that it be influenced by Bobs to avoid any
    4                                                                  diffusion of responsibility, i.e. many small AIs that disregard
      Note though that this may be very relevant assuming that there
are several Alices, as will be proposed later on.                      safety concerns on the ground that they each hardly have any
Figure 2: We propose to decompose alignment into 5 steps. Each step is associated with further substeps or techniques. Also,
there are critical subproblems that will likely be useful for several of the 5 steps.


global impact on the world.                                             alignment research to gain momentum, it seems crucial to
   What makes this nontrivial is that any AI may gain ca-               make debating more informative, respectful and stimulating.
pability and influence over time. An unaligned weak AI
could eventually become an unaligned human-level AI. To                                          Conclusion
avoid this, even basic, but potentially unboundedly self-
improving5 AIs should be given at least a seed of alignment,            This paper discussed the alignment problem, that is, the
which may grow as AIs become more powerful. More gen-                   problem of aligning the goals of AIs with human prefer-
erally, AIs should strike a balance between some original               ences. It presented a general roadmap to tackle this issue.
(possibly unaligned) objective and the importance they give             Interestingly, this roadmap identifies 5 critical steps, as well
to alignment. This may be called the alignment burden as-               as many relevant aspects of these 5 steps. In other words, we
signment problem.                                                       have presented a large number of hopefully more tractable
   Figure 2 recapitulates our complete roadmap.                         subproblems that readers are highly encouraged to tackle.
                                                                        We hope that combining the solutions to these subproblems
                                                                        could help to partially address alignment. And we hope that
                Non-technical challenges                                any reader will be able to better determine how he or she
Given the difficulty of alignment, its resolution will surely           may best contribute to the global effort6 .
require solving a large number of non-technical challenges                 Acknowledgment. The author would like to thank El
as well. We briefly mention some of them here.                          Mahdi El Mhamdi, Henrik Aslund, Sébastien Rouault and
                                                                        Alexandre Maurer for fruitful discussions.
   Perhaps most important is the lack of respectability that
is sometimes associated with this line of research. For align-
ment to be solved, it needs to gain respectability from the                                      References
scientific community, and perhaps beyond this community                 Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul-
as well. This is why it seems to be of the utmost importance            man, J.; and Mané, D. 2016. Concrete problems in AI safety.
that discussions around alignment be carried out carefully to           arXiv preprint arXiv:1606.06565.
avoid confusions.
   Evidently, alignment definitely needs much more man-                 Arora, S.; Lund, C.; Motwani, R.; Sudan, M.; and Szegedy,
power, which will require funding and recruiting. It seems              M. 1998. Proof verification and the hardness of approxima-
particularly important to attract mathematical talents to-              tion problems. Journal of the ACM (JACM) 45(3):501–555.
wards this line of work. This evidently also raises the chal-           Arora, S.; Hazan, E.; and Kale, S. 2012. The multiplicative
lenge of training as many brilliant minds as possible.                  weights update method: a meta-algorithm and applications.
   Finally, questions around AI, AI safety and moral philos-            Theory of Computing 8(1):121–164.
ophy are sadly often poorly debated. There often is a lot of            Arrow, K. J. 1950. A difficulty in the concept of social
overconfidence, and a lack of well-founded reasoning. For               welfare. Journal of political economy 58(4):328–346.
    5                                                                      6
      In particular, nonparametric AIs should perhaps be treated dif-        Please note that a more complete version of this paper is also
ferently from parametric ones.                                          available (Hoang 2018b).
Babai, L. 1985. Trading group theory for randomness. In         Hoang, L. N.; Soumis, F.; and Zaccour, G. 2016. Measuring
Proceedings of the seventeenth annual ACM symposium on          unfairness feeling in allocation problems. Omega 65:138–
Theory of computing, 421–429. ACM.                              147.
Baird, L. 2016. Hashgraph consensus: fair, fast, byzantine      Hoang, L. N. 2017. Strategy-proofness of the random-
fault tolerance. Technical report, Swirlds Tech Report.         ized condorcet voting system. Social Choice and Welfare
Balinski, M., and Laraki, R. 2011. Majority judgment: mea-      48:679–701.
suring, ranking, and electing. MIT press.                       Hoang, L. N. 2018a. A roadmap for the value-loading prob-
Blanchard, P.; El Mhamdi, E. M.; Guerraoui, R.; and Stainer,    lem. arXiv preprint arXiv:1809.01036.
J. 2017. Machine learning with adversaries: Byzantine tol-      Hoang, L. N. 2018b. La formule du savoir : une philosophie
erant gradient descent. In Advances in Neural Information       unifiée du savoir fondée sur le théorème de Bayes. EDP
Processing Systems, 119–129.                                    Sciences. English translation forthcoming.
Bloom, P. 2016. Against Empathy: The Case for Rational          Huang, C.; Kairouz, P.; Chen, X.; Sankar, L.; and Rajagopal,
Compassion. Ecco.                                               R. 2017. Context-aware generative adversarial privacy. En-
Bostrom, N. 2014. Superintelligence: Paths, Dangers,            tropy 19(12):656.
Strategies. OUP Oxford.                                         Institute for Health Metrics and Evaluation (IHME), Univer-
Christiano, P.; Shlegeris, B.; and Amodei, D. 2018. Su-         sity of Washington. 2016. Gbd compare data visualization.
pervising strong learners by amplifying weak experts. In        Irving, G.; Christiano, P.; and Amodei, D. 2018. Ai safety
review.                                                         via debate. arXiv preprint arXiv:1805.00899.
Damaskinos, G.; El Mhamdi, E. M.; Guerraoui, R.; Patra,         Jean, N.; Burke, M.; Xie, M.; Davis, W. M.; Lobell, D. B.;
R.; Taziki, M.; et al. 2018. Asynchronous byzantine machine     and Ermon, S. 2016. Combining satellite imagery and ma-
learning (the case of sgd). In International Conference on      chine learning to predict poverty. Science 353(6301):790–
Machine Learning, 1153–1162.                                    794.
Dwork, C.; Roth, A.; et al. 2014. The algorithmic founda-       Kahneman, D. 2011. Thinking, fast and slow. Farrar, Straus
tions of differential privacy. Foundations and Trends R in      and Giroux New York.
Theoretical Computer Science 9(3–4):211–407.                    Kramer, A. D.; Guillory, J. E.; and Hancock, J. T. 2014.
El Mhamdi, E. M.; Guerraoui, R.; Hendrikx, H.; and Maurer,      Experimental evidence of massive-scale emotional conta-
A. 2017. Dynamic safe interruptibility for decentralized        gion through social networks. Proceedings of the National
multi-agent reinforcement learning. In Advances in Neural       Academy of Sciences 201320040.
Information Processing Systems, 130–140.                        Liou, C.-Y.; Huang, J.-C.; and Yang, W.-C. 2008. Modeling
Evans, O.; Stuhlmüller, A.; and Goodman, N. D. 2016.           word perception using the elman network. Neurocomputing
Learning the preferences of ignorant, inconsistent agents. In   71(16-18):3150–3157.
AAAI, 323–329.                                                  Lopuhaa, H. P.; Rousseeuw, P. J.; et al. 1991. Break-
Gibbard, A. 1973. Manipulation of voting schemes: a gen-        down points of affine equivariant estimators of multivariate
eral result. Econometrica: journal of the Econometric Soci-     location and covariance matrices. The Annals of Statistics
ety 587–601.                                                    19(1):229–248.
Gilmer, J.; Metz, L.; Faghri, F.; Schoenholz, S. S.; Raghu,     Lowd, D., and Meek, C. 2005. Adversarial learning. In
M.; Wattenberg, M.; and Goodfellow, I. 2018. Adversarial        International Conference on Machine Learning, 641–647.
spheres. arXiv preprint arXiv:1801.02774.                       ACM.
Goldwasser, S.; Micali, S.; and Rackoff, C. 1989. The           Martin, J.; Everitt, T.; and Hutter, M. 2016. Death and sui-
knowledge complexity of interactive proof systems. SIAM         cide in universal artificial intelligence. In Artificial General
Journal on computing 18(1):186–208.                             Intelligence. Springer. 23–32.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;           Mhamdi, E. M. E.; Guerraoui, R.; and Rouault, S. 2018.
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.      The hidden vulnerability of distributed learning in byzan-
2014. Generative adversarial nets. In Advances in neural        tium. In International Conference on Machine Learning,
information processing systems, 2672–2680.                      3518–3527.
Hadfield-Menell, D.; Dragan, A.; Abbeel, P.; and Rus-           Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef-
sell, S. 2016a. The off-switch game. arXiv preprint             ficient estimation of word representations in vector space.
arXiv:1611.08219.                                               arXiv preprint arXiv:1301.3781.
Hadfield-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan,    Nakamoto, S. 2008. Bitcoin: A peer-to-peer electronic cash
A. 2016b. Cooperative inverse reinforcement learning. In        system.
Advances in neural information processing systems, 3909–        Neumann, J. v., and Morgenstern, O. 1944. Theory of games
3917.                                                           and economic behavior. Princeton: Princeton.
Haidt, J. 2012. The righteous mind: Why good people are         Ng, A. Y.; Russell, S. J.; et al. 2000. Algorithms for inverse
divided by politics and religion. Vintage.                      reinforcement learning. In Icml, 663–670.
Organization, W. H., et al. 2009. Death and daly estimates
for 2004 by cause for who member states.
Orseau, L., and Armstrong, M. 2016. Safely interruptible
agents. In Uncertainty in Artificial Intelligence: 32nd Con-
ference (UAI 2016), edited by Alexander Ihler and Dominik
Janzing, 557–566.
Ricci, F.; Rokach, L.; and Shapira, B. 2015. Recommender
systems: introduction and challenges. In Recommender sys-
tems handbook. Springer. 1–34.
Russell, S.; Dewey, D.; and Tegmark, M. 2015. Research
priorities for robust and beneficial artificial intelligence. AI
Magazine 36(4):105–114.
Satterthwaite, M. A. 1975. Strategy-proofness and arrow’s
conditions: Existence and correspondence theorems for vot-
ing procedures and social welfare functions. Journal of eco-
nomic theory 10(2):187–217.
Simpson, E. H. 1951. The interpretation of interaction in
contingency tables. Journal of the Royal Statistical Society.
Series B (Methodological) 238–241.
Soares, N., and Fallenstein, B. 2017. Agent foundations
for aligning machine intelligence with human interests: a
technical research agenda. In The Technological Singularity.
Springer. 103–125.
Soares, N. 2015. Aligning superintelligence with human in-
terests: An annotated bibliography. Intelligence 17(4):391–
444.
Soares, N. 2016. The value learning problem. In Ethics for
Artificial IntelligenceWorkshop at 25th International Joint
Conference on Artificial Intelligence.
Su, J.; Vargas, D. V.; and Kouichi, S. 2017. One pixel
attack for fooling deep neural networks. arXiv preprint
arXiv:1710.08864.
Tegmark, M. 2017. Life 3.0. Being Human in the Age of
Artificial Intelligence. NY: Allen Lane.
Thaler, R., and Sunstein, C. 2009. Nudge: Improving Deci-
sions About Health, Wealth, and Happiness. Penguin Books.
Wängberg, T.; Böörs, M.; Catt, E.; Everitt, T.; and Hutter, M.
2017. A game-theoretic analysis of the off-switch game. In
International Conference on Artificial General Intelligence,
167–177. Springer.
Yudkowsky, E. 2004. Coherent extrapolated volition. Sin-
gularity Institute for Artificial Intelligence.