Towards Robust End-to-End Alignment Lê Nguyên Hoang EPFL Chemin Alan Turing, Lausanne 1015, Switzerland Abstract entirety. This motivated us to propose a roadmap for robust end-to-end alignment. Robust alignment is arguably both critical and extremely While much of our proposal is speculative, we believe that challenging. Loosely, it is the problem of designing algorith- mic systems with strong guarantees of always being bene- several of the ideas presented here will be critical for AI ficial for mankind. In this paper, we propose a preliminary safety and alignment. More importantly, we hope that this research program to address it in a reinforcement learning will be a useful roadmap for both AI experts and non-experts framework. This roadmap aims at decomposing the end-to- to better estimate how they can best contribute to the effort. end alignment problem into numerous more tractable sub- Given the complexity of the problem, our roadmap here problems. We hope that each subproblem is sufficiently or- will likely be full of gaps and false good ideas. It is important thogonal to others to be tackled independently, and that com- to note that our purpose is not to propose a definite perfect bining the solutions to all such subproblems may yield a so- solution. Rather, we aim at presenting a sufficiently good lution to alignment. starting point for others to build upon. Introduction The Roadmap As they are becoming more and more capable and ubiq- Our roadmap consists of identifying key steps to alignment. uitous, AIs are raising numerous concerns, including fair- For the sake of exposition, these steps will be personified ness, privacy, filter bubbles, addiction, job displacement or by 5 characters, called Alice, Bob, Charlie, Dave and Erin. even existential risks (Russell, Dewey, and Tegmark 2015; Roughly speaking, Erin will be collecting data from the Tegmark 2017). It has been argued that aligning the goals world, Dave will use these data to infer the likely states of of AI systems with humans’ preferences would be an ef- the world, Charlie will compute the desirability of the likely ficient way to make them reliably beneficial and to avoid states of the world, Bob will derive incentive-compatible re- potential catastrophic risks (Bostrom 2014; Hoang 2018a). wards to motivate Alice to take the right decision, and Al- In fact, given the global influence of today’s large-scale rec- ice will optimize decision-making. This decomposition is ommender systems (Kramer, Guillory, and Hancock 2014), graphically represented in Figure 1. it already seems urgent to propose even partial solutions to alignment. Unfortunately, it has also been argued that alignment is an extremely difficult problem. In fact, (Bostrom 2014) ar- gues that it “is a research challenge worthy of some of the next generation’s best mathematical talent”. To address it, Figure 1: We decompose the alignment problem into 5 key the Future of Life Institute proposed a landscape of AI safety steps: data collection, world model inference, desirability research1 . Meanwhile, (Soares 2015; Soares and Fallenstein learning, incentive design and reinforcement learning. 2017) listed important ideas in this line of work. We hope that this paper will contribute to outline the main challenges Evidently, Alice, Bob, Charlie, Dave and Erin need not be posed by alignment. 5 different AIs. Typically, it may be much more computa- In particular, we shall introduce a complete research pro- tionally efficient to merge Charlie and Dave. Nevertheless, gram to robustly align AIs. Robustness here refers to numer- at least for pedagogical reasons, it seems useful to first dis- ous possible failure modes, including overfitting, hazardous sociate the different roles that these AIs have. exploration, evasion attacks, poisoning attacks, crash toler- In the sequel, we shall further detail the challenges posed ance, Byzantine resilience, reward hacking and wireheading. by each of the 5 AIs. We shall also argue that, for robustness To guarantee such a robustness, we argue that it is desirable and scalability reasons, these AIs will need to be further di- to structure (at least conceptually) our AI systems in their vided into many more AIs. We will see that this raises ad- ditional challenges. We shall also make a few non-technical 1 https://futureoflife.org/landscape/ remarks, before concluding. Alice’s Reinforcement learning measures that are proposed should not be too constrain- It seems that today’s most promising framework for large- ing. In other words, there are constraints on the safety con- scale AIs is that of reinforcement learning. In reinforcement straints that can be imposed. This is what makes AI safety learning, an AI can be regarded as a decision-making pro- so challenging. cess. At time t, the AI observes some state of the world st . As a result, what is perhaps more interesting are the ideas Depending on its inner parameters θt , it then takes (possibly proposed by (Amodei et al. 2016) to make reinforcement randomly) some action at . learning safer, especially using model lookahead. This es- The decision at then influences the next state and turns it sentially corresponds to Alice simulating many likely sce- into st+1 . The transition from st to st+1 given action at is narii before undertaking any action. More generally, Alice usually assumed to be nondeterministic. In any case, the AI faces a safe exploration problem. then receives a reward Rt+1 . The internal parameters θt of But this is not all. Given that AIs will likely be based on the AI may then be updated into θt+1 , depending on previ- machine learning, and given the lack of verification meth- ous parameters θt , action at , state st+1 and reward Rt+1 . ods for AIs obtained by machine learning, we should not Note that this is a very general framework. In fact, we hu- expect AIs to be correct all the time. Just like humans, AIs mans are arguably (at least partially) subject to this frame- will likely be sometimes wrong. But this is extremely wor- work. At any point in time, we observe new data st that in- rysome. Indeed, even if an AI is right 99.9999% of the time, forms us about the world. Using an inner model of the world it will still be wrong one time out of a million. Yet, AIs like θt , we then infer what the world probably is like, which mo- recommender systems or autonomous cars take billions of tivates us to take some action at . This may affect what likely decisions every day. In such cases, thousands of AI deci- next data st+1 will be observed, and may be accompanied sions may be unboundedly wrong every day! with a rewarding (or painful) feeling Rt+1 , which will moti- This problem can become even more worrysome if we vate us to update our inner model of the world θt into θt+1 . take into account the fact that hackers may attempt to take Let us call Alice the AI in charge of performing this re- advantage of the AIs’ deficiencies. Such hackers may typi- inforcement learning reasoning. Alice can thus be viewed as cally submit only data that corresponds to cases where the an algorithm, which inputs observed states st and rewards AIs are wrong. This is known as evasion attacks (Lowd and Rt , and undertakes actions at so as to typically maximize Meek 2005; Su, Vargas, and Kouichi 2017; Gilmer et al. some discounted sum of expected future rewards. 2018). To avoid evasion attacks, it is crucial for an AI to Such actions will probably be mostly of the form of mes- never be unboundedly wrong, e.g. by reliably measuring its sages sent through the Internet. This may sound benign. But own confidence in its decisions and to ask for help in cases it is not. The YouTube recommender system might suggest of great uncertainty. billions of antivax videos, causing a major decrease of vac- Now, even if Alice is well-designed, she will only be an cination and an uprise of deadly diseases. Worse, if an AI is effective optimization algorithm. Unfortunately, this is no in control of 3D-printers, then a message that tells them to guarantee of safety or alignment. Typically, because of hu- construct killer drones to cause a genocide would be catas- mans’ well-known addiction to echo chambers (Haidt 2012), trophic. On a brighter note, if an AI now promotes convinc- a watch-time maximization YouTube recommender AI may ing eco-friendly messages every day to billions of people, amplify filter bubbles, which may lead to worldwide geopo- the public opinion on climate change may greatly change. litical tensions. Both misaligned and unaligned AIs will Note that, as opposed to all other components, in some likely lead to very undesirable consequences. sense, Alice is the real danger. Indeed, in our framework, In fact, (Bostrom 2014) even argues that, to best reach its she is the only one that really undertakes actions. More pre- goals, any sufficiently strategic AI will likely first aim at so- cisely, only her actions will be unconstrained (although oth- called instrumental goals, e.g. gaining vastly more resources ers highly influence her decision-making and are thus criti- and guaranteeing self-preservation. But this is very unlikely cal as well). to be in humans’ best interests. In particular, it will likely As a result, it is of the utmost importance that Alice be motivate the AI to undertake actions that we would not re- well-designed. Some of the past work (Orseau and Arm- gard as desirable. strong 2016; El Mhamdi et al. 2017) have proposed to re- To make sure that Alice will want to behave as we want strict the learning capabilities of Alice to provide provable her to, it seems critical to at least partially control the ob- desirable properties. Typically, they proposed to allow only served state st+1 or the reward Rt+1 . Note that this is simi- a subclass of learning algorithms, i.e. of update rules of θt+1 lar to the way children are taught to behave. We do so by ex- as a function of (θt , at , st+1 , Rt+1 ). However, such restric- posing them to specific observed states, by punishing them tions might be too costly. And this may be a big problem. when the sequence (st , at , st+1 ) is undesirable, and by re- Indeed, there is already a race between competing com- warding them when the sequence (st , at , st+1 ) is desirable. panies in competing countries to construct powerful AIs. Whether or not Alice’s observed state st is constrained, While it might be possible for some countries to impose her rewards Rt are clearly critical. They are her incentives, some restrictions to some AIs of some companies, it is un- and will thus determine her decision-making. Unfortunately, likely that all companies of all countries will accept to be determining the adequate rewards Rt to be given to Alice is restricted, especially if the restrictions are too constraining. an extremely difficult problem. It is, in fact, the key to align- In fact, AI safety will be useful only if the most powerful ment. Our roadmap to solve it identifies 4 key steps incar- AIs are all subject to safety measures. As a result, the safety nated by Erin, Dave, Charlie and Bob. Erin’s data collection problem Given how crucial it is for Dave to have an unbiased rep- In order to do good, it is evidently crucial to be given a lot of resentation of the world, much care will be needed to make reliable data. Indeed, even the most brilliant mind will be un- sure that Dave’s inference will foresee selection biases. For able to know anything about the world if it does not have any instance, when asked to provide images of CEOs, Google data from that world. This is particularly true when the goal Image may return a greater ratio of male CEOs than the ac- is to undertake desirable actions, or to make sure that one’s tual ratio. More generally, such biases can be regarded as action will not have potentially catastrophic consequences. instances of Simpson’s paradox (Simpson 1951), and boil Evidently, much data is already available on the Internet. down to the saying ”correlation is not causation”. It seems It is likely that any large-scale AI will have access to the In- crucial that Dave does not fall into this trap. ternet, as is already the case of the Facebook recommender In fact, data can be worse than unintentionally misleading. system. However, it is important to take into account the Given how influential Alice may be, there will likely be great fact that the data on the Internet is not always fully reliable. incentives for many actors to bias Erin’s data gathering, and It may be full of fake news, fraudulent entries, misleading to thus fool Dave. This is known as poisoning attacks (Blan- videos, hacked posts and corrupted files. chard et al. 2017; Mhamdi, Guerraoui, and Rouault 2018; It may then be relevant to invest in more reliable and rel- Damaskinos et al. 2018). It seems extremely important that evant data collection. This would be Erin’s job. Typically, Dave anticipate the fact that the data he was given may be Erin may want to collect economic metrics to better assess purposely biased, if not hacked. Like any good journalist, needs. Recently, it has been shown that satellite images com- Dave will likely need to cross information from different bined with deep learning allow to compute all sorts of use- sources to infer the most likely states of the world. ful economic indicators (Jean et al. 2016), including poverty This inference approach is well captured by the Bayesian risks and agricultural productivity. It is possible that the use paradigm (Hoang 2018b). In particular, Bayes rule is de- of still more sensors can further increase our capability to signed to infer the likely causes of the observed data D. improve life standards, especially in developing countries. These causes can also be regarded as theories T (and such To guarantee the reliability of such data, cryptographic theories may assume that some of the data were hacked). and distributed computing solutions are likely to be use- Bayes rule tells us that the reliability of theory T given data ful as well, as they already are on the web. In particu- D can be derived formally by the following computation: lar, distributed computing, combined with recent Byzantine- P[D|T ]P[T ] resilient consensus algorithms like Blockchain (Nakamoto P[T |D] = . 2008) or Hashgraph (Baird 2016), could guarantee the reli- P[D] able storage and traceability of critical information. One typical instance of Dave’s job is the problem of infer- Note though that such data collection mechanisms could ring global health from a wide variety of collected data. This pose major privacy issues. It is a major current challenge is what has been done by (Institute for Health Metrics and to balance the usefulness of collected data and the privacy Evaluation (IHME), University of Washington 2016), using violation they inevitably cause. Some possible solutions in- a sophisticated Bayesian model that reconstructed the likely clude differential privacy (Dwork, Roth, and others 2014), or causes of deaths in countries where data were lacking. weaker versions like generative-adversarial privacy (Huang Importantly, Bayes rule also tells us that we should not et al. 2017). It could also be possible to combine these with fully believe any single theory. This simply corresponds to more cryptographic solutions, like homomorphic encryption saying that data can often be interpreted in many different or multi-party computation. It is interesting that such cryp- mutually incompatible manners. It seems important to rea- tographic solutions may be (essentially) provably robust to son with all possible interpretations rather than isolating a any attacker, including a superintelligence2 . single interpretation that may be flawed. When the space of possible states of the world is large, Dave’s world model problem which will surely be the case of Dave, it is often computa- Unfortunately, raw data are usually extremely messy, redun- tionally intractable to reason with the full posterior distribu- dant, incomplete, unreliable, poisoning and even hacked. To tion P[T |D]. Bayesian methods often rather propose to sam- tackle these issues, it is necessary to infer the likely actual ple from the posterior distribution to identify a reasonable states of the world, given Erin’s collected data. This will be number of good interpretations of the data. These sampling Dave’s job. methods include Monte-Carlo methods, as well as Markov- The overarching principle of Dave’s job is probably going Chain Monte-Carlo (MCMC) ones. to be some deep representation learning. This corresponds In some sense, Dave’s job can be regarded as writing a to determining low-dimensional representations of high- compact report of all likely states of the world, given Erin’s dimensional data. This basic idea has given rise to today’s collected data. It is an open question as of what language most promising unsupervised machine learning alogrithms, Dave’s report will be in. It might be useful to make it under- e.g. word vectors (Mikolov et al. 2013), autoencoders (Liou, standable by humans. But it might be too costly as well. In- Huang, and Yang 2008) and generative adversarial net- deed, Dave’s report might be billions of pages long. It could works (GANs) (Goodfellow et al. 2014). be unreasonable or undesirable to make it humanly readable. Note also that Erin and Dave are likely to gain cogni- 2 tive capabilities over time. It is surely worthwhile to an- The possible use of quantum computers may require postquan- tum cryptography. ticipate the complexification of Erin’s data and of Dave’s world models. It seems unclear so far how to do so. Some This is known as a social choice problem. In its general high-level (purely descriptive) language to describe world form, it is the problem of aggregating the preferences of a models is probably needed. In addition, this high-level lan- group of disagreeing people into a single preference for the guage may need to be flexible enough to be reshaped and re- whole group that, in some sense, fairly well represents the designed over time. This may be dubbed the world descrip- individuals’ preferences. Unfortunately, social choice theory tion problem. It is arguably still a very open and uncharted is plagued with impossibility results, e.g. Arrow’s theorem area of research. (Arrow 1950) or the Gibbard-Satterthwaite theorem (Gib- bard 1973; Satterthwaite 1975). Again, we should not be too Charlie’s desirability learning problem demanding regarding the properties of our preference aggre- gation. Besides, this is the path taken by social choice theory, Given any of Dave’s world models, Charlie’s job will then e.g. by proposing randomized solutions to preserve some de- be to compute how desirable this world model is. This is sirable properties (Hoang 2017). the desirability learning problem (Soares 2016), also known as value learning3 . This is the problem of assigning desir- One particular proposal, known as majority judgment ability scores to different world models. These desirability (Balinski and Laraki 2011), may be of particular interest to scores can then serve as the basis for any agent to determine us here. Its basic idea is to choose some deciding quantile beneficial actions. q ∈ [0, 1] (often taken to be q = 1/2). Then, for any pos- Unfortunately, determining what, say, the median human sible state of the world, consider all individuals’ desirability considers desirable is an extremely difficult problem. But scores for that state. This yields a distribution of humans’ again, it should be stressed that we should not aim at deriv- preferences for the state of the world. Majority judgment ing an ideal inference of what people desire. This is likely then concludes that the group’s score is the quantile q of to be a hopeless endeavor. Rather, we should try our best this distribution. If q = 1/2, this corresponds to the score to make sure that Charlie’s desirability scores will be good chosen by the median individual of the group. enough to avoid catastrophic outcomes, e.g. world destruc- Now, to avoid an oppression of a majority over some mi- tion, global sufferance or major discrimination. nority, it might be relevant to choose a small value of q, say One proposed solution to infer human preferences is so- q = 0.1. This would mean that Charlie’s scoring of a state called inverse reinforcement learning (Ng, Russell, and oth- of the world will be less than a number score, if more than ers 2000; Evans, Stuhlmüller, and Goodman 2016). Assum- 10% of the people believe that this state should be given a ing that humans perform reinforcement learning to choose score less than score. But evidently, this point is very much their actions, and given examples of actions taken by hu- debatable. It seems unclear so far how to best choose q. mans in different contexts, inverse reinforcement learning While majority judgment seems to be a promising ap- infers what were the humans’ likely implicit rewards that proach, it does raise the question of how to compare two dif- motivated their decision-making. Assuming we can some- ferent individuals’ scores. It is not clear that score = 5 given how separate humans’ selfish rewards from altruistic ones, by John has a meaning comparable to Jane’s score = 5. In inverse reinforcement learning seems to be a promising first fact, according to a theorem by von Neumann and Morgen- step towards inferring humans’ preferences from data. There stern (Neumann and Morgenstern 1944), within their frame- are, however, many important considerations to be taken into work, utility functions are only defined up to a positive affine account, which we discuss below. transformation. More work is probably needed to determine First, it is important to keep in mind that, despite Dave’s how to scale different individuals’ utility functions appro- effort and because of Erin’s limited and possibly biased data priately, despite previous attempts in special cases (Hoang, collection, Dave’s world model is fundamentally uncertain. Soumis, and Zaccour 2016). Again, it should be stressed that In fact, as discussed previously, Dave would probably rather we should not aim at an ideal solution; a workable reason- present a distribution of likely world models. Charlie’s job able solution is much better than no solution at all. should be regarded as a scoring of all such likely world mod- Now, arguably, humans’ current preferences are almost els. In particular, she should not assign a single number to surely undesirable. Indeed, over the last decades, psychol- the current state of the world, but, rather, a distribution of ogy has been showing again and again that human think- likely scores of the current state of the world. This distribu- ing is full of inconsistencies, fallacies and cognitive biases tion should convey the uncertainty about the actual state of (Kahneman 2011). We tend to first have instinctive reactions the world. Besides, as we shall see, this uncertainty is likely to stories or facts (Bloom 2016), which quickly becomes the to be crucial for Bob to choose incentive-compatible rewards position we will want to defend at all costs (Haidt 2012). for Alice adequately. Worse, we are unfortunately largely unaware of why we be- Another challenging aspect of Charlie’s job will be to pro- lieve or want what we believe or want. This means that our vide a useful representation of potential human disagree- current preferences are unlikely to be what we would prefer, ments about the desirability of different states of the world. if we were more informed, thought more deeply, and tried to Humans’ preferences are diverse and may never converge. make sure our preferences were as well-founded as possible. This should not be swept under the rug. Instead, we need to And arguably, we should prefer what we would prefer to agree on some way to mitigate disagreement. prefer, rather than what we instinctively prefer. Typically, one might prefer to watch a cat video, even though one might 3 prefer to prefer mathematics videos over cat videos. Desir- To avoid raising eyebrows, we shall try to steer away from polarizing terminologies like values, moral or ethics. ablity scores should arguably encode what we would prefer to prefer, rather than what we instinctively prefer. our best to describe, informally and formally, what better To understand, a thought experiment may be useful. Let versions of ourselves would likely regard as desirable. Let us imagine better versions of us. Each current me is thereby us try to predict the volition of me++ ’s. associated with a me++ . A me++ is what current me would This attempt is likely going to be shocking to us all. In- desire, if current me were smarter, thought much longer deed, we should expect that better versions of ourselves will about what he finds desirable, and analyzed all imaginable find desirable things that the current versions of ourselves data of the world. Arguably, me++ ’s desirability score is find repelling. Unfortunately though, we humans tend to re- “more right” than current me’s. act poorly to disagreeing jugments. And this is likely to hold This can be illustrated by the fact that past standards are even when the oppositions are our better selves. This poses often no longer regarded as desirable. Our intuitions about a great scientific and engineering challenge. How can one be the desirability of slavery, homosexuality and gender dis- best convinced of the judgments that he or she will eventu- crimination have been completely upset over the last cen- ally embrace but does not yet? In other words, how can we tury, if not over the last few decades. It seems unlikely that quickly agree with better versions of ourselves? What could all of our other intuitions will never change. In particular, someone else say to get me closer to my me++ ? This may it seems unlikely that me++ will fully agree with current be dubbed the individual improvement problem. me. And it seems reasonable to argue that me++ would be To address this issue, (Irving, Christiano, and Amodei “more right” than current me. 2018) have discussed the possibility of setting up a debate These remarks are the basis of coherent extrapolated voli- between opposing AIs. In particular, they asked whether a tion (Yudkowsky 2004). The basic idea is that we should aim human judge would be able to lean towards the better AI at the preferences that future versions of ourselves would for the right reasons. Interestingly, such a debate might al- eventually adopt, if they were vastly more informed, had low for significantly more powerful “proofs of superiority” much more time to ponder what they regard as desirable, than monologues, at least if the analogy with the so-called and tried their best to be better versions of themselves. In polynomial hierarchy of complexity theory holds. some sense, instead of making current me’s debate about This question is critical for alignment as it will likely be what’s desirable (which often turns into a pointless debacle), a key challenge to build trust in the systems we design. But we should let me++ ’s debate. In fact, since me++ ’s suppos- evidently, this is a more general question that should be of edly already know everything about other me++ ’s, there is interest to anyone who desires to do good. actually no point in getting them to debate. It suffices to ag- gregate their preferences through some social choice mech- Bob’s incentive design anism. This is the preference aggregation problem. The last piece of the jigsaw is Bob’s job. Bob is in charge of It is noteworthy that we clearly have epistemic uncertainty computing the rewards that Alice will receive, based on the about me++ ’s. Determining me++ ’s desirability scores work of Erin, Dave and Charlie. Evidently he could simply may be called the coherent extrapoled individual volition compute the expectation of Charlie’s scores for the likely problem. Interestingly, this is (mostly) a prediction problem. states of the world. But this is probably a bad idea, as it But it is definitely too ambitious to predict them with ab- opens the door to reward hacking. solute uncertainty. Bayes rule tells us that we should rather Recall that Alice’s goal is to maximize her discounted describe these desirability scores by a probability distribu- expected future rewards. But given that Alice knows (or is tions of likely desirability scores. likely to eventually guess) how her rewards are computed, Such scores could also be approximated using a large instead of undertaking the actions that we would want her number of proxies, as is done by boosting methods (Arora, to, Alice could hack Erin, Dave or Charlie’s computations, Hazan, and Kale 2012). The use of several proxies could so that such hacked computations yield large rewards. This avoid the overfitting of any proxy. Typically, rather than re- is sometimes called the wireheading problem. lying solely on DALYs (Organization and others 2009), we Since all this computation starts with Erin’s data collec- probably should invoke machine learning methods to com- tion, one way for Alice to increase her rewards would be to bine a large number of similar metrics, especially those that feed Erin with fake data that will make Dave infer a deeply aim at describing other desirable economic metrics, like hu- flawed state of the world, which Charlie may regard as ideal. man development index (HDI) or gross national happiness Worse, Alice may then find out that the best way to do so (GNH). Still another approach may consist of analyzing would be to invest all of Earth’s resources into mislead- “typical” human preferences, e.g. by using collaborative fil- ing Erin, Dave and Charlie. This could potentially be ex- tering techniques (Ricci, Rokach, and Shapira 2015). Evi- tremely bad for mankind. Indeed, especially if Alice cares dently, much more research is needed along these lines. about discounted future rewards, she might eventually re- Computing the desirability of a given world state is Char- gard mankind as a possible threat to her objective. lie’s job. In some sense, Charlie’s job would thus be to re- This is why it is of the utmost importance that Alice’s in- move cognitive biases from our intuitive preferences, so that centives be (partially) aligned with Erin, Dave and Charlie they still basically reflect what we really regard as prefer- performing well and being accurate. This will be Bob’s job. able, but in a more coherent and informed manner. This is an Bob will need to make sure that, while Alice’s rewards do incredibly difficult problem, which will likely take decades correlate with Charlie’s scores, they also give Alice the in- to sort out reasonably well. This is why it is of the utmost centives to guarantee that Erin, Dave and Charlie perform as importance that it be started as soon as possible. Let us try reliably as possible the job they were given. In fact, it even seems desirable that Alice be incentivized Decentralization to constantly upgrade Erin, Dave and Charlie for the bet- We have decomposed alignment into 5 components for the ter. Ideally, she would even want them to be computation- sake of exposition. However, any component will likely have ally more powerful than herself, especially in the long run. to be decentralized to gain reliability and scalability. In other This approach would bear resemblance with the idea of words, instead of having a single Alice, a single Bob, a sin- self-nudge (Thaler and Sunstein 2009). This corresponds to gle Charlie, a single Dave and a single Erin, it seems cru- strategies that we humans sometimes use to nudge ourselves cial to construct multiple Alices, Bobs, Charlies, Daves and (or others) into doing what we want to want to do, rather Erins. than what our latest emotion or laziness invites us to do. This is key to crash-tolerance. Indeed, a single com- Unfortunately, it seems unclear how Bob can best make puter doing Bob’s job could crash and leave Alice with- sure that Alice has such incentives. Perhaps a good idea is to out reward nor penalty. But if Alice’s rewards are an ag- penalize Dave’s reported uncertainty about the likely states gregate of rewards given by a large number of Bobs, then of the world. Typically, Bob should make sure Alice’s re- even if some of the Bobs crash, Alice’s rewards will remain wards are affected by the reliability of Erin’s data. The more mostly the same. But crash-tolerance is likely to be insuf- reliable Erin’s data, the larger Alice’s rewards. Similarly, ficient. Instead, we should design Byzantine-resilient mech- when Dave or Charlie feel that their computations are unreli- anisms, that is, mechanisms that still perform correctly de- able, Bob should take note of this and adjust Alice’s rewards spite the presence of hacked or malicious Bobs. Estimators accordingly to motivate Alice to provide larger resources for with large statistical breakdowns (Lopuhaa, Rousseeuw, and Charlie’s computations. others 1991), e.g. (geometric) medians and variants (Blan- Now, Bob should also mitigate the desire to retrieve more chard et al. 2017), may be useful for this purpose. reliable data and perform more trustworthy computations Evidently, in this Byzantine environment, cryptography, with the fact that such efforts will necessarily require the especially (postquantum?) cryptographical signatures and exploitation of more resources, probably at the expense of hashes, are likely to play a critical role. Typically, Bobs’ Charlie’s scores. It is this non-trivial trade-off that Bob will rewards will likely need to be signed. More generally, the need to take care of. careful design of secure communication channels between Bob’s work might be simplified by some (partial) control the components of the AIs seems key. This may be called of Alice’s action or world model. Although it seems unclear the secure messaging problem. so far how, techniques like interactive proofs (IP) (Babai Another difficulty is the addition of more powerful and 1985; Goldwasser, Micali, and Rackoff 1989) or probabilis- precise Bobs, Charlies, Daves and Erins to the pipeline. It tically checkable proofs (PCP) (Arora et al. 1998) might be is not yet clear how to best integrate reliable new comers, useful to force Alice to prove its correct behavior. By re- especially given that such new comers may be malicious. In questing such proofs to yield large rewards, Bob might be fact, they may want to first act benevolent to gain admis- able to incentivize Alice’s transparency. All such considera- sion. But once they are numerous enough, they could take tions make up Bob’s incentive problem. over the pipeline and, say, feed Alice with infinite rewards. This is the upgrade problem, which was recently discussed It may or may not be useful to enable Bob to switch off by (Christiano, Shlegeris, and Amodei 2018) who proposed Alice. It should be stressed though that (safe) interruptibility using numerous weaker AIs to supervise stronger AIs. More is nontrivial, as discussed by (Orseau and Armstrong 2016; research in this direction is probably needed. El Mhamdi et al. 2017; Martin, Everitt, and Hutter 2016; Now, in addition to reliability, decentralization may also Hadfield-Menell et al. 2016a; 2016b; Wängberg et al. 2017) enable different Alices, Bobs, Charlies, Daves and Erins to among others. In fact, safe interruptibility seem to require focus on specific tasks. This would allow to separate differ- very specific circumstances, e.g. Alice being indifferent to ent problems, which could lead to more optimized solutions interruption, Alice being programmed to be suicidal in case at lower costs. To this end, it may be relevant to adapt differ- of potential harm or Alice having more uncertainty about ent Alices’ rewards to their specific tasks. Note though that her rewards than Bob being able to take over Alice’s job. It this could also be a problem, as Alices may enter in com- seems unclear so far how relevant such circumstances will petition with one another like in the prisoner’s dilemma. We be to Bob’s control problem over Alice4 . Besides, instead of may call it the specialization problem. Again, there seems to interrupting Alice, Bob might prefer to guide Alice towards be a lot of new research needed to address this problem. preferable actions by acting on Alice’s rewards. Another open question is the extent to which AIs should On another note, it may be computationally more efficient be exposed to Bobs’ rewards. Typically, if a small company for all if, instead of merely transmitting a reward, Bob also creates its own AI, to what extent should this AI be aligned? feeds Alice with ”backpropagating signals”, that is, informa- It should be noted that this may be computationally very tion not about the reward itself, but about its gradient with costly, as it may be hard to separate the signal of interest respect to key variables, e.g. Charlie’s score or Erin’s relia- to the AI from the noise of Bobs’ rewards. Intuitively, the bility. Having said this, we leave open the technical question more influential an AI is, the more it should be influenced of how to best design this. by Bobs’ rewards. But even if this AI is small, it may be im- portant to demand that it be influenced by Bobs to avoid any 4 diffusion of responsibility, i.e. many small AIs that disregard Note though that this may be very relevant assuming that there are several Alices, as will be proposed later on. safety concerns on the ground that they each hardly have any Figure 2: We propose to decompose alignment into 5 steps. Each step is associated with further substeps or techniques. Also, there are critical subproblems that will likely be useful for several of the 5 steps. global impact on the world. alignment research to gain momentum, it seems crucial to What makes this nontrivial is that any AI may gain ca- make debating more informative, respectful and stimulating. pability and influence over time. An unaligned weak AI could eventually become an unaligned human-level AI. To Conclusion avoid this, even basic, but potentially unboundedly self- improving5 AIs should be given at least a seed of alignment, This paper discussed the alignment problem, that is, the which may grow as AIs become more powerful. More gen- problem of aligning the goals of AIs with human prefer- erally, AIs should strike a balance between some original ences. It presented a general roadmap to tackle this issue. (possibly unaligned) objective and the importance they give Interestingly, this roadmap identifies 5 critical steps, as well to alignment. This may be called the alignment burden as- as many relevant aspects of these 5 steps. In other words, we signment problem. have presented a large number of hopefully more tractable Figure 2 recapitulates our complete roadmap. subproblems that readers are highly encouraged to tackle. We hope that combining the solutions to these subproblems could help to partially address alignment. And we hope that Non-technical challenges any reader will be able to better determine how he or she Given the difficulty of alignment, its resolution will surely may best contribute to the global effort6 . require solving a large number of non-technical challenges Acknowledgment. The author would like to thank El as well. We briefly mention some of them here. Mahdi El Mhamdi, Henrik Aslund, Sébastien Rouault and Alexandre Maurer for fruitful discussions. Perhaps most important is the lack of respectability that is sometimes associated with this line of research. For align- ment to be solved, it needs to gain respectability from the References scientific community, and perhaps beyond this community Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul- as well. This is why it seems to be of the utmost importance man, J.; and Mané, D. 2016. Concrete problems in AI safety. that discussions around alignment be carried out carefully to arXiv preprint arXiv:1606.06565. avoid confusions. Evidently, alignment definitely needs much more man- Arora, S.; Lund, C.; Motwani, R.; Sudan, M.; and Szegedy, power, which will require funding and recruiting. It seems M. 1998. Proof verification and the hardness of approxima- particularly important to attract mathematical talents to- tion problems. Journal of the ACM (JACM) 45(3):501–555. wards this line of work. This evidently also raises the chal- Arora, S.; Hazan, E.; and Kale, S. 2012. The multiplicative lenge of training as many brilliant minds as possible. weights update method: a meta-algorithm and applications. Finally, questions around AI, AI safety and moral philos- Theory of Computing 8(1):121–164. ophy are sadly often poorly debated. There often is a lot of Arrow, K. J. 1950. A difficulty in the concept of social overconfidence, and a lack of well-founded reasoning. For welfare. Journal of political economy 58(4):328–346. 5 6 In particular, nonparametric AIs should perhaps be treated dif- Please note that a more complete version of this paper is also ferently from parametric ones. available (Hoang 2018b). Babai, L. 1985. Trading group theory for randomness. In Hoang, L. N.; Soumis, F.; and Zaccour, G. 2016. Measuring Proceedings of the seventeenth annual ACM symposium on unfairness feeling in allocation problems. Omega 65:138– Theory of computing, 421–429. ACM. 147. Baird, L. 2016. Hashgraph consensus: fair, fast, byzantine Hoang, L. N. 2017. Strategy-proofness of the random- fault tolerance. Technical report, Swirlds Tech Report. ized condorcet voting system. Social Choice and Welfare Balinski, M., and Laraki, R. 2011. Majority judgment: mea- 48:679–701. suring, ranking, and electing. MIT press. Hoang, L. N. 2018a. A roadmap for the value-loading prob- Blanchard, P.; El Mhamdi, E. M.; Guerraoui, R.; and Stainer, lem. arXiv preprint arXiv:1809.01036. J. 2017. Machine learning with adversaries: Byzantine tol- Hoang, L. N. 2018b. La formule du savoir : une philosophie erant gradient descent. In Advances in Neural Information unifiée du savoir fondée sur le théorème de Bayes. EDP Processing Systems, 119–129. Sciences. English translation forthcoming. Bloom, P. 2016. Against Empathy: The Case for Rational Huang, C.; Kairouz, P.; Chen, X.; Sankar, L.; and Rajagopal, Compassion. Ecco. R. 2017. Context-aware generative adversarial privacy. En- Bostrom, N. 2014. Superintelligence: Paths, Dangers, tropy 19(12):656. Strategies. OUP Oxford. Institute for Health Metrics and Evaluation (IHME), Univer- Christiano, P.; Shlegeris, B.; and Amodei, D. 2018. Su- sity of Washington. 2016. Gbd compare data visualization. pervising strong learners by amplifying weak experts. In Irving, G.; Christiano, P.; and Amodei, D. 2018. Ai safety review. via debate. arXiv preprint arXiv:1805.00899. Damaskinos, G.; El Mhamdi, E. M.; Guerraoui, R.; Patra, Jean, N.; Burke, M.; Xie, M.; Davis, W. M.; Lobell, D. B.; R.; Taziki, M.; et al. 2018. Asynchronous byzantine machine and Ermon, S. 2016. Combining satellite imagery and ma- learning (the case of sgd). In International Conference on chine learning to predict poverty. Science 353(6301):790– Machine Learning, 1153–1162. 794. Dwork, C.; Roth, A.; et al. 2014. The algorithmic founda- Kahneman, D. 2011. Thinking, fast and slow. Farrar, Straus tions of differential privacy. Foundations and Trends R in and Giroux New York. Theoretical Computer Science 9(3–4):211–407. Kramer, A. D.; Guillory, J. E.; and Hancock, J. T. 2014. El Mhamdi, E. M.; Guerraoui, R.; Hendrikx, H.; and Maurer, Experimental evidence of massive-scale emotional conta- A. 2017. Dynamic safe interruptibility for decentralized gion through social networks. Proceedings of the National multi-agent reinforcement learning. In Advances in Neural Academy of Sciences 201320040. Information Processing Systems, 130–140. Liou, C.-Y.; Huang, J.-C.; and Yang, W.-C. 2008. Modeling Evans, O.; Stuhlmüller, A.; and Goodman, N. D. 2016. word perception using the elman network. Neurocomputing Learning the preferences of ignorant, inconsistent agents. In 71(16-18):3150–3157. AAAI, 323–329. Lopuhaa, H. P.; Rousseeuw, P. J.; et al. 1991. Break- Gibbard, A. 1973. Manipulation of voting schemes: a gen- down points of affine equivariant estimators of multivariate eral result. Econometrica: journal of the Econometric Soci- location and covariance matrices. The Annals of Statistics ety 587–601. 19(1):229–248. Gilmer, J.; Metz, L.; Faghri, F.; Schoenholz, S. S.; Raghu, Lowd, D., and Meek, C. 2005. Adversarial learning. In M.; Wattenberg, M.; and Goodfellow, I. 2018. Adversarial International Conference on Machine Learning, 641–647. spheres. arXiv preprint arXiv:1801.02774. ACM. Goldwasser, S.; Micali, S.; and Rackoff, C. 1989. The Martin, J.; Everitt, T.; and Hutter, M. 2016. Death and sui- knowledge complexity of interactive proof systems. SIAM cide in universal artificial intelligence. In Artificial General Journal on computing 18(1):186–208. Intelligence. Springer. 23–32. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Mhamdi, E. M. E.; Guerraoui, R.; and Rouault, S. 2018. Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. The hidden vulnerability of distributed learning in byzan- 2014. Generative adversarial nets. In Advances in neural tium. In International Conference on Machine Learning, information processing systems, 2672–2680. 3518–3527. Hadfield-Menell, D.; Dragan, A.; Abbeel, P.; and Rus- Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef- sell, S. 2016a. The off-switch game. arXiv preprint ficient estimation of word representations in vector space. arXiv:1611.08219. arXiv preprint arXiv:1301.3781. Hadfield-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan, Nakamoto, S. 2008. Bitcoin: A peer-to-peer electronic cash A. 2016b. Cooperative inverse reinforcement learning. In system. Advances in neural information processing systems, 3909– Neumann, J. v., and Morgenstern, O. 1944. Theory of games 3917. and economic behavior. Princeton: Princeton. Haidt, J. 2012. The righteous mind: Why good people are Ng, A. Y.; Russell, S. J.; et al. 2000. Algorithms for inverse divided by politics and religion. Vintage. reinforcement learning. In Icml, 663–670. Organization, W. H., et al. 2009. Death and daly estimates for 2004 by cause for who member states. Orseau, L., and Armstrong, M. 2016. Safely interruptible agents. In Uncertainty in Artificial Intelligence: 32nd Con- ference (UAI 2016), edited by Alexander Ihler and Dominik Janzing, 557–566. Ricci, F.; Rokach, L.; and Shapira, B. 2015. Recommender systems: introduction and challenges. In Recommender sys- tems handbook. Springer. 1–34. Russell, S.; Dewey, D.; and Tegmark, M. 2015. Research priorities for robust and beneficial artificial intelligence. AI Magazine 36(4):105–114. Satterthwaite, M. A. 1975. Strategy-proofness and arrow’s conditions: Existence and correspondence theorems for vot- ing procedures and social welfare functions. Journal of eco- nomic theory 10(2):187–217. Simpson, E. H. 1951. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological) 238–241. Soares, N., and Fallenstein, B. 2017. Agent foundations for aligning machine intelligence with human interests: a technical research agenda. In The Technological Singularity. Springer. 103–125. Soares, N. 2015. Aligning superintelligence with human in- terests: An annotated bibliography. Intelligence 17(4):391– 444. Soares, N. 2016. The value learning problem. In Ethics for Artificial IntelligenceWorkshop at 25th International Joint Conference on Artificial Intelligence. Su, J.; Vargas, D. V.; and Kouichi, S. 2017. One pixel attack for fooling deep neural networks. arXiv preprint arXiv:1710.08864. Tegmark, M. 2017. Life 3.0. Being Human in the Age of Artificial Intelligence. NY: Allen Lane. Thaler, R., and Sunstein, C. 2009. Nudge: Improving Deci- sions About Health, Wealth, and Happiness. Penguin Books. Wängberg, T.; Böörs, M.; Catt, E.; Everitt, T.; and Hutter, M. 2017. A game-theoretic analysis of the off-switch game. In International Conference on Artificial General Intelligence, 167–177. Springer. Yudkowsky, E. 2004. Coherent extrapolated volition. Sin- gularity Institute for Artificial Intelligence.