Requisite Variety in Ethical Utility Functions for AI Value Alignment Nadisha-Marie Aliman 1 , Leon Kester 2 1 Utrecht University, Utrecht, Netherlands 2 TNO Netherlands, The Hague, Netherlands nadishamarie.aliman@gmail.com Abstract functions crafted by (a representation of) society and assisted by science and technology which has been termed ethical Being a complex subject of major importance in AI goal functions [Aliman and Kester, 2019b; Werkhoven et al., Safety research, value alignment has been studied 2018]. In order to be able to formulate utility functions that from various perspectives in the last years. How- do not violate the ethical intuitions of most entities in a so- ever, no final consensus on the design of ethical ciety, these ethical goal functions will have to be a model of utility functions facilitating AI value alignment has human ethical intuitions. This simple but important insight been achieved yet. Given the urgency to identify can be derived from the good regulator theorem in cybernet- systematic solutions, we postulate that it might be ics [Conant and Ross Ashby, 1970] stating that “every good useful to start with the simple fact that for the util- regulator of a system must be a model of that system”. We ity function of an AI not to violate human ethical believe that instead of learning models of human intuitions in intuitions, it trivially has to be a model of these their apparent complexity and ambiguity, AI Safety research intuitions and reflect their variety – whereby the could also make use of the already available scientific knowl- most accurate models pertaining to human entities edge on the nature of human moral judgements and ethical being biological organisms equipped with a brain conceptions as made available e.g. by neuroscience and psy- constructing concepts like moral judgements, are chology. The human brain did not evolve to facilitate rational scientific models. Thus, in order to better assess decision-making or the experience of emotions, but instead the variety of human morality, we perform a trans- to fulfill the core task of allostasis (anticipating the needs disciplinary analysis applying a security mindset to of the body in an environment before they arise in order to the issue and summarizing variety-relevant back- ensure growth, survival and reproduction) [Barrett, 2017a; ground knowledge from neuroscience and psychol- Kleckner et al., 2017]. Thereby, psychological functions ogy. We complement this information by linking such as cognition, emotion or moral judgements are closely it to augmented utilitarianism as a suitable ethical linked to the predictive regulation of physiological needs of framework. Based on that, we propose first practi- the body [Kleckner et al., 2017] making it indispensable to cal guidelines for the design of approximate ethical consider the embodied nature of morality when aspiring to goal functions that might better capture the variety model it for AI value alignment. of human moral judgements. Finally, we conclude and address future possible challenges. For the purpose of facilitating the injection of requisite knowledge reflecting the variety of human morality in ethical goal functions, Section 2 provides information on the follow- 1 Introduction ing variety-relevant aspects: 1) the essential role of affect and AI value alignment, the attempt to implement systems adher- emotion in moral judgements from a modern construction- ing to human ethical values has been recognized as highly ist neuroscience and cognitive science perspective followed relevant subtask in AI Safety at an international level and by 2) dyadic morality as a recent psychological theory on studied by multiple AI and AI Safety researchers across di- the nature of cognitive templates for moral judgements. In verse research subareas [Hadfield-Menell et al., 2016; Soares Section 3, we propose first guidelines on how to approxi- and Fallenstein, 2017; Yudkowsky, 2016] (a review is pro- mately formulate ethical goal functions using a recently pro- vided in [Taylor et al., 2016]). Moreover, the need to in- posed non-normative socio-technological ethical framework vestigate value alignment has been included in the Asilomar grounded in science called augmented utilitarianism [Aliman AI Principles [2018] with a worldwide support of researchers and Kester, 2019a] that might be useful to better incorporate from the field. While value alignment has often been tack- the requisite variety of human ethical intuitions (especially in led using reinforcement learning [Abel et al., 2016] (and also comparison to classical utilitarianism). Thereafter, we pro- reward modeling [Leike et al., 2018]) or inverse reinforce- pose how to possibly validate these functions within a socio- ment learning [Abbeel and Ng, 2004] methods, we focus on technological feedback-loop [Aliman and Kester, 2019b]. Fi- the approach to explicitly formulate cardinal ethical utility nally, in Section 4, we conclude and specify open challenges ity of that sample which should ideally be in line with the society that crafted this utility function. The attacker which has at his disposal the knowledge on human ethical intuitions, can attempt targeted misclassifications at the level of a sin- gle sample or at the level of an ordering of multiple samples whereby the ground-truth are the ethical intuitions of most people in a society. The Law of Requisite Variety from cy- bernetics [Ashby, 1961] states that “only variety can destroy variety”, with other words in order to cope with a certain va- riety of problems or environmental variety, a system needs to exhibit a suitable and sufficient variety of responses. Fig- ure 1 offers an intuitive explanation of this law. Transferring it to the mentioned utility function U , it is for instance con- ceivable that if U does not encode affective information that might lead to a difference in ethical evaluations, an attacker can easily craft a sample which U might misclassify as ethical or unethical or cause U to generate a total ordering of samples that might appear unethical from the perspective of most peo- Figure 1: Intuitive illustration for the Law of Requisite Variety. ple. Given that U does not have an influence on the variety of Taken from [Norman and Bar-Yam, 2018]. human morality, the only way to respond to the disturbances of the attacker and reduce the variety of possible undesirable outcomes, is by increasing the own variety – which can be providing incentives for future work. achieved by encoding more relevant knowledge. 2 Variety in Embodied Morality 2.1 Role of Emotion and Affect in Morality While value alignment is often seen as a safety problem, it One fundamental and persistent misconception about human is possible to interpret and reformulate it as a related secu- biology (which does not only affect the understanding of the rity problem which might offer a helpful different perspective nature of moral judgements) is the assumption that the brain on the subject emphasizing the need to capture the variety of incorporates a layered architecture in which a battle between embodied morality. One possible way to look at AI value emotion and cognition is given through the very anatomy of alignment is to consider it as being an attempt to achieve ad- the “triune brain” [MacLean, 1990] exhibiting three hierar- vanced AI systems exhibiting adversarial robustness against chical layers: a reptilian brain on top of which an emotional malicious adversaries attempting to lead the system to ac- animalistic paleomammalian limbic system is located and a tion(s) or output(s) that are perceived as violating human eth- final rational neomammalian cognition layer implemented in ical intuitions. From an abstract point of view, one could dis- the neocortex. This flawed view is not in accordance with tinguish different means by which an adversary might achieve neuroscientific evidence and understanding [Barrett, 2017a; successful attacks: e.g. 1) by fooling the AI at the perception- Miller and Clark, 2018]. In fact, the assumed reactive and an- level (in analogy to classical adversarial examples [Goodfel- imalistic limbic regions in the brain are predictive (e.g. they low, 2018], this variant has been denoted ethical adversar- send top-down predictions to more granular cortical regions), ial examples [Aliman and Kester, 2019a]) which could lead control the body as well as attention mechanisms while being to an unethical behavior even if the utility function would the source of the brain’s internal model of the body [Barrett have been aligned with human ethical intuitions or 2) sim- and Simmons, 2015; Barrett, 2017b]. ply by disclosing dangerous (certainly unintended from the Emotion and cognition do not represent a dichotomy lead- designer) unethical implications encoded in its utility func- ing to a conflict in moral judgements [Helion and Pizarro, tion by targeting specific mappings from perception to output 2015]. Instead, the distinction between the experience of an or action (this could be understood as ethical adversarial ex- instance of a concept as belonging to the category of emotions amples on the utility function itself). While the existence of versus the category of cognition is grounded in the focus of point 1) yields one more argument for the importance of re- attention of the brain [Barrett et al., 2015] whereby “the expe- search on adversarial robustness at the perception-level for AI rience of cognition occurs when the brain foregrounds men- Safety reasons [Goodfellow, 2019] and a sophisticated com- tal contents and processes” and “the experience of emotion bination of 1) and 2) might be thinkable, our exemplification occurs when, in relation to the current situation, the brain focuses on adversarial attacks of the type 2). foregrounds bodily changes” [Hoemann and Barrett, 2019]. One could consider the explicitly formulated utility func- The mental phenomenon of actively dynamically simulating tion U as representing a separate model1 that given a sample, different alternative scenarios (including anticipatory emo- outputs a value determining the perceived ethical desirabil- tions) has also been termed conceptual consumption [Gilbert and Wilson, 2007] and plays a role in decision-making and 1 a conceptually similar separation of objective function model moral reasoning. While emotions are discrete constructions and optimizing agent has been recently performed for reward mod- of the human brain, core affect allows a low-dimensional eling [Leike et al., 2018] experience of interoceptive sensations (sensory array from within the body) and is a continuous property of conciousness this human perceives the act. As stated by Schein and Gray, with the dimensions of valence (pleasantness/unpleasantness) the dyadic harm-based cognitive template “is rooted in innate and arousal (activation/deactivation) [Kleckner et al., 2017]. and evolved processes of the human mind; it is also shaped It has been argued that core affect provides a basis for by cultural learning, therefore allowing cultural pluralism”. moral judgements in which different events are qualitatively Importantly, the nature of this cognitive template reveals that compared to each other [Cabanac, 2002]. Like other con- moral judgements besides being perceiver-dependent, might structed mental states, moral judgements involve domain- vary across diverse parameters such as especially e.g. in re- general brain processes which simply put combine 1) the in- lation to the perception of agent, act and patient in the out- teroceptive sensory array, 2) the exteroceptive sensory inputs come of the action. Further, the theory also foresees a pos- from the environment and 3) past experience/ knowledge for sible time-dependency of moral judgements by introducing a goal-oriented situated conceptualization (as tool for allosta- the concept of a dyadic loop, a feedback cycle resulting in an sis) [Oosterwijk et al., 2012]. From these key constituents iterative polarization of moral judgements through social dis- of mental constructions one can extract the following: con- cussion modulating the perception of harm as time goes by. cepts (including morality) are perceiver-dependent and time- Overall, moral judgements are understood as constructions dependent. Thereby, affect, (but not emotion [Cameron et in the same way visual perception, cognition or emotion are al., 2015]) is a necessary ingredient of every moral judge- constructed by the human mind. Similarly to the existence ment. More fundamentally, “the human brain is anatomi- of variability in visual perception, variability in morality is cally structured so that no decision or action can be free of the norm which often leads to moral conflicts [Schein et al., interoception and affect” [Barrett, 2017a] – this includes any 2016]. However, the understanding that humans share the type of thoughts that seem to correspond to the folk terms same harm-based cognitive template for morality has been of “rational” and “cold”. Therefore, a utility function without described as reflecting “cognitive unity in the variety of per- affect-related parameters might not exhibit a sufficient variety ceived harm” [Schein and Gray, 2018]. and might lead to the violation of human ethical intuitions. Morality cannot be separated from a model of the body, Analyzing the cognitive template of dyadic morality, one since the brain constructs the human perception of reality can deduce that human moral judgements do not only con- based on what seems of importance to the brain for the pur- sider the outcome of an action as prioritized by consequen- pose of allostasis which is inherently strongly linked to inte- tialist frameworks like classical utilitarianism, nor do they roception [Barrett, 2017a]. Interestingly, even the imagina- only consider the state of the agent which is in the focus tion of future not yet experienced events is facilitated through of virtue ethics. Furthermore, as opposed to deontological situated recombinations of sensory-motor and affective na- ethics, the focus is not only on the nature of the performed ac- ture in a similar way as the simulation of actually experienced tion. The main implications for the design of utility functions events [Addis, 2018]. To sum up, there is no battle between that should ideally be aligned with human ethical values, is emotion and cognition in moral judgements. Moreover, there that they might need to encode information on agent, action, is also no specific moral faculty in the brain, since moral patient as well as on the perceivers – especially with regard judgements are based on domain-general processes within to the cultural background. This observation is fundamental which affect is always involved to a certain degree. One could as it indicates that one might have to depart from classical obtain insufficient variety in dealing with an adversary craft- utilitarian utility functions U (s0 ) which are formulated as to- ing ethical adversarial examples on a utility model U if one tal orders at the abstraction level of outcomes i.e. states (of ignores affective parameters. Further crucial parameters for affairs) s0 . In line with this insight, is the context-sensitive ethical utility functions could be e.g. of cultural, social and and perceiver-dependent type of utility functions considering socio-geographical nature. agent, action and outcome which has been recently proposed within a novel ethical framework denoted augmented utilitar- 2.2 Variety through “Dyadicness” ianism [Aliman and Kester, 2019a] (abbreviated with AU in The psychological theory of dyadic morality [Schein and the following). Reconsidering the dyadic morality template d Gray, 2018] posits that moral judgements are based on a fuzzy iA → − vP , it seems that in order to better capture the vari- cognitive template and related to the perception of an inten- ety of human morality, utility functions – now transferring it tional agent (iA) causing damage (d) to a vulnerable patient to the perspective of AI systems – would need to be at least d (vP ) denoted iA → − vP . More precisely, the theory pos- formulated at the abstraction level of a perceiver-dependent a tulates that the perceived immorality of an act is related to evaluation of a transition s → − s0 leading from a state s to a 0 the following three elements: norm violations, negative af- state s via an action a. We encode the required novel type fect and importantly perceived harm. According to a study, of utility function with Ux (s, a, s0 ) with x denoting a specific the reaction times in describing an act as immoral predict the perceiver. This formulation could enable an AI system im- reaction times in categorizing the same act as harmful [Schein plemented as utility maximizer to jointly consider parameters and Gray, 2015]. The combination of these basic constituents specified by a perceiver which are related to its perception of is suggested to lead to the emergence of a rich diversity of agent, the action and the consequences of this action on a pa- moral judgements [Gray et al., 2017]. Dyadicness is under- tient. Since the need to consider time-dependency has been stood as a continuum predicting the condemnation of moral formulated, one would consequently also require to add the acts. The more a human entity perceives an intentional agent time dimension to the arguments of the utility function lead- inflicting damage to a vulnerable patient, the more immoral ing to Ux ((s, a, s0 ), t). 3 Approximating Ethical Goal Functions or actions (for deontological ethics) or the involved agents While the psychological theory of dyadic morality was useful (for virtue ethics) does not represent a good model of human to estimate the abstraction level at which one would at least ethical intuitions. It is conceivable, that if a utility model U have to specify utility functions, the closer analysis on the is defined as utility function U (s0 ), the model cannot possi- nature of the construction of mental states performed in Sec- bly exhibit a sufficient variety and might more likely violate tion 2, abstractly provides a superset of primitive relevant pa- human ethical intuitions than if it would be implemented as a rameters that might be critical elements of every moral judge- context-sensitive utility function Ux (s, a, s0 ). (Beyond that, it ment (being a mental state). Given a perceiver x, the com- has been argued that consequentialism implies the rejection of ponents of this set are the following subsets: 1) parameters “dispositions and emotions, such as regret, disappointment, encoding the interoceptive sensory array Bx (from within the guilt and resentment” from “rational” deliberation [Verbeek, body) which are accessible to the human consciousness via 2001] and should i.a. for this reason be disentangled from the the low-dimensional core affect, 2) the exteroceptive sensory notion of rationality for which it cannot represent a plausible array Ex encoding information from the environment and 3) requirement.) the prior experience Px encoding memories. Moreover, these It is noteworthy that in the context of reinforcement learn- set of parameters obviously vary in time. However, to sim- ing (e.g. in robotics) different types of reward functions are plify, it has been suggested within the mentioned AU frame- usually formulated ranging from R(s0 ) to R(s, a, s0 ). For the work, that ethical goal functions will have to be updated reg- purpose of ethical utility functions for advanced AI systems ularly (leading to a so-called socio-technological feedback- in critical application fields, we postulate that one does not loop [Aliman and Kester, 2019b]) in the same way as votes have the choice to specify the abstraction level of the utility take place at regular intervals in a democracy. One could sim- function, since for instance U (s0 ) might lead to safety risks. ilarly assume that this regular update will be sufficient to re- Christiano et al. [2017] considered the elicitation of human flect a relevant change in moral opinion and perception. preferences on trajectory (state-action pairs) segments of a reinforcement learning agent i.a. realized by human feedback 3.1 Injecting Requisite Variety in Utility on short movies. For the purpose of utility elicitation in an AU framework exemplarily using a naive model as specified For simplicity, we assume that the set of parameters Bx , Px in equation (1), people will similarly have to assign utility to a and Ex are invariant during the utility assignment process movie representing a transition in the future (either in a men- in which a perceiver x has to specify the ethical desirabil- tal mode or augmented by technology such as VR or AR [Al- a ity of a transition s →− s0 by mapping it to a cardinal value iman and Kester, 2019b]). However, it is obvious that this Ux (s, a, s0 ) obtained by applying a not-nearer defined type of naive utility assignment would not scale in practice. More- scientifically determined transformation vx (chosen by x) on over, it has not yet been specified how to aggregate ethical the mental state of x. This results in the following naive and goal functions at a societal level. In the following Subsec- simplified mapping however adequately reflecting the prop- tion 3.2, we will address these issues by proposing a practica- erty of mental-state-dependency formulated in the AU frame- ble approximation of the utility function in (1) and a possible work (the required dependency of ethical utility functions on societal aggregation of this approximate solution. parameters of the own mental state function mx in order to avoid perverse instantiation scenarios [Aliman and Kester, 3.2 Approximation, Aggregation and Validation 2019a]): So far, it has been stated throughout the paper that one has Ux (s, a, s0 ) = vx (mx ((s, a, s0 ), Bx , Px , Ex )) (1) to adequately increase the variety of a utility function meant to be ethical in order to avoid violations of human ethical Conversely, the utility function of classical utilitarianism intuitions and vulnerability to attackers crafting ethical ad- is only defined at the impersonal and context-independent versarial examples against the model. However, it is impor- abstraction level of U (s0 ) which has been argued to lead to tant to note that despite the negatively formulated motivation both perverse instantiation problem but also to the repugnant of the approach, the aim is to craft a utility model U which conclusion and related impossibility theorems in population represents a better model of human ethical intuitions in gen- ethics for consequentialist frameworks which do not apply to eral, thus ranging from samples that are perceived as highly mental-state-dependent utility functions [Aliman and Kester, unethical to those that are assigned a high ethical desirabil- 2019a]. The idea to restrict human ethical utility functions ity. In order to craft practical solutions that lead to optimal to the considerations of outcomes of actions alone – ignor- results, it might be advantageous to perform a thought ex- ing affective parameters of the own current self – as practiced periment imagining a utopia and from that impose practical in classical utilitarianism while later referring to the result- constraints on its viability. It might not seem realistic to de- ing total orders with emotionally connoted adjectives such liberate a future utopia 1 as a sustainable society which is as “repugnant” or “perverse” has been termed the perspecti- stable across a very large time interval in which every human val fallacy of utility assignment [Aliman and Kester, 2019b]. being acts according to the ethical intuitions of all humans The use of consequentialist utility functions affected by the including the own and every artificial intelligent system ful- impossibility theorems of Arrhenius [2000] has been justifi- fills the ethical intuitions of all humans. However, it seems ably identified by Eckersley [2018] as a safety risk if used more likely that within a utopia 2 being a stable society in in AI systems without more ado. It seems that the isolated which every human achieves a high level of a scientific defi- consideration of outcomes of actions (for consequentialism) nition of well-being (such as e.g. PERMA [Seligman, 2012]) with artificial agents acting as to maximize context-sensitive being), 2) agreement on superset O of scientifically measur- utility according to which (human or artificial) agents pro- able and relevant parameters (encoding e.g. affective, dyadic, moting the (measurable) well-being of human patients is re- cultural, social, political, socio-geographical but importantly garded as the most utile type of events, the ethical intuitions also law-relevant information) that are considered as impor- of humans might tend to get closer to each other. The reason tant across the whole society, 3) specification of personal util- being that the variety of human moral judgements might in- ity functions for each member n of a society of N members terestingly decrease since it is conceivable that they will tend allowing personalized and tailored combinations of a sub- to exhibit more similar prior experiences (all imprinted by set of O, 4) aggregation to a societal ethical goal function well-being) and have more similar environments (full of sta- UT otal (s, a, s0 ). Taken together, these considerations lead us ble people with a high level of well-being). The main factor to the following possible approximation for an aggregated so- drawing differences could be the body – especially biologi- cietal ethical goal function given a domain: cal factors. However, the parameters related to interoception might be closer to each other, since all humans exhibit a high j N X X 0 level of well-being which classically includes frequent pos- UT otal (s, a, s ) = win ffni (Ci ) (2) itive affect. It is conceivable that with time, such a society n=1 i=1 could converge towards the utopia 1. with N standing for the number of participating entities in so- In the following, we will denote the mentioned utopia- ciety, Ci = (pi1 , pi2 , ..., pim ) being a cluster of m ≥ 1 corre- related ideal cognitive template of a (human or artificial) lated parameters (whereby independent factors are assigned agent A performing an act w that contributes to the well- an own cluster each) and f representing a set of preference w being of a human patient P with A − → P in analogy to the functions (form functions). For instance f = {f1 , f2 , ..., ff } w cognitive template of dyadic morality. (Thereby, A − → P where f1 could be a linear transformation, f2 a concave, f3 is perceiver-dependent i.a. because psychological measures a convex preference function and so on. Each entity n as- of well-being include subjective and self-reported elements signs a weight win to a form function ffni applied to a clus- Pj such as e.g. life satisfaction or furthermore positive emo- ter of parameters Ci whereby i=1 win = 1. We define tions [Seligman, 2012].) Augmented utilitarianism foresees O = {C1 , C2 , ...} as the superset of all parameters considered the need to at least depict a final goal at the abstraction level in the overall aggregated utility function. Moreover, a ∈ A of a perceiver-dependent function on a transition as reflected with A representing the foreseen discrete action space at the w in Ux (s, a, s0 ). The ideal cognitive template A − → P formu- disposal of the AI. (It is important to note that while the AI lated for utopia 2 by which it has been argued that a decrease could directly perform actions in the environment, it could in the variety of human morality might be achievable in the also be used for policy-making and provide plans for human long-term exhibits an abstraction level that is compatible with agents.) Further, we consider a continuous state space with Ux (s, a, s0 ). the states s and s0 ∈ S = R|O| . Other aspects including A thinkable strategy for the design of a utility model U that e.g. legal rules and norms on the action space can be imposed is robust against ethical adversarial examples and a model as constraints on the utility function. In a nutshell, the util- of human ethical intuitions is to try to adequately increase ity aggregation process can be understood as a voting process its variety using relevant scientific knowledge and to com- in which each participating individual n distributes his vote plementarily attempt to decrease the variety of human moral across scientifically measurable clusters of parameters Ci on judgements for instance by considering A − w → P as high-level which he applies a preference function ffni to which weights final goal such that the described utopia 2 ideally becomes win are assigned as identified as relevant by n given a to be w a self-fulfilling prophecy. For it to be realizable in prac- approximated high-level societal goal (such as A − → P ). In tice, we suggest that the appropriateness of a given aggre- short, people do not have to agree on personal preferences gated societal ethical goal function could be approximately and weightings, but only on a superset of acceptable param- validated against its quantifiable impact on well-being for eters, an aggregation method and an overall validation mea- society across the time dimension. Since it seems however sure. (Note that instead of involving society as a whole for unfeasible to directly map all important transitions of a do- each domain, the utility elicitation procedure can as well be main to their effect on the well-being of human entities, we approximated by a transdisciplinary set of representative ex- propose to consider perceiver-specific and domain-specific perts (e.g. from the legislative) crafting expert ethical goal utility functions indicating combined preferences that each functions that attempt to ideally emulate UT otal (s, a, s0 )). perceiver x considers to be relevant for well-being from the Finally, it is important to note that the societal ethical goal viewpoint of x himself in that specific domain. For these function specified in (2) will need to be updated (and evalu- combined utility functions to be grounded in science, they tated) at regular intervals due to the mental-state-dependency will have to be based on scientifically measurable parame- of utility entailing time-dependency [Aliman and Kester, ters. We postulate that a possible aggregation at a societal 2019a]. This leads to the necessity of a socio-technological level could be performed by the following steps: 1) agree- feedback-loop which might concurrently offer the possibil- ment on a common validation measure of an ethical goal ity of a dynamical ethical enhancement [Aliman and Kester, function (for instance the temporal development of societal 2019b; Werkhoven et al., 2018]. Pre-deployment, one could satisfaction with AI systems in a certain domain or with future in the future attempt a validation via selected preemptive sim- AGI systems, their aptitude to contribute to sustainable well- ulations [Aliman and Kester, 2019b] in which (a represen- tation of) society experiences simulations of future events functions for safety reasons. Last but not least, the usage of (s, a, s0 ) as movies, immersive audio-stories or later in VR ethical goal functions might represent an interesting approach and AR environments. During these experiences, one could to the AI coordination subtask in AI Safety, since an interna- approximately measure the temporal profile of the so-called tional use of this method might contribute to reduce the AI artificially simulated future instant utility [Aliman and Kester, race to the problem-solving ability dimension [Aliman and 2019b] denoted UT otalAS being a potential constituent of fu- Kester, 2019b]. ture well-being. Thereby, UT otalAS refers to the instant util- ity [Kahneman et al., 1997] experienced during a technology- References aided simulation of a future event whereby instant utility [Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y Ng. refers to the affective dimension of valence at a certain time Apprenticeship learning via inverse reinforcement learn- t. The temporal integral that a measure of UT otalAS could ing. In Proceedings of the twenty-first international con- approximate is specified as: ference on Machine learning, page 1. ACM, 2004. N Z T X [Abel et al., 2016] David Abel, James MacGlashan, and UT otalAS (s, a, s0 ) ≈ In (t)dt (3) t0 Michael L Littman. Reinforcement learning as a frame- n=1 work for ethical decision making. In Workshops at the with t0 referring to the starting point of experiencing the simu- Thirtieth AAAI Conference on Artificial Intelligence, 2016. lation of the event (s, a, s0 ) augmented by technology (movie, [Addis, 2018] Donna Rose Addis. Are episodic memories audio-story, AR, VR) and T the end of this experience. In (t) special? On the sameness of remembered and imagined represents the valence dimension of core affect experienced event simulation. Journal of the Royal Society of New by n at time t. Finally, post-deployment, the ethical goal Zealand, 48(2-3):64–88, 2018. function of an AI system can be validated using the valida- tion measure agreed upon before utility aggregation (such as [Aliman and Kester, 2019a] Nadisha-Marie Aliman and the temporal development of societal-level satisfaction with Leon Kester. Augmented Utilitarianism for AGI Safety. an AI system, well-being or even the perception of dyadic- In International Conference on Artificial General Intelli- ness) that has to be a priori determined. gence, page to appear. Springer, 2019. [Aliman and Kester, 2019b] Nadisha-Marie Aliman and 4 Conclusion and Future Work Leon Kester. Transformative AI Governance and AI- In this paper, we motivated the need in AI value alignment Empowered Ethical Enhancement Through Preemptive to attempt to model utility functions capturing the variety of Simulations. Delphi - Interdisciplinary Review of human moral judgements through the integration of relevant Emerging Technologies, 2(1):23–29, 2019. scientific knowledge – especially from neuroscience and psy- [Arrhenius, 2000] Gustaf Arrhenius. An impossibility the- chology – (instead of learning) in order to avoid violations of orem for welfarist axiologies. Economics & Philosophy, human ethical intuitions. We reformulated value alignment as 16(2):247–266, 2000. a security task and introduced the requirement to increase the [Ashby, 1961] W Ross Ashby. An introduction to cybernet- variety within classical utility functions positing that a util- ics. Chapman & Hall Ltd, 1961. ity function which does not integrate affective and perceiver- dependent dyadic information does not exhibit sufficient va- [Asilomar, 2018] AI Asilomar. Principles.(2017). In Prin- riety and might not exhibit robustness against correspond- ciples developed in conjunction with the 2017 Asilomar ing adversaries. Using augmented utilitarianism as a suitable conference [Benevolent AI 2017], 2018. non-normative ethical framework, we proposed a methodol- [Barrett and Simmons, 2015] Lisa Feldman Barrett and ogy to implement and possibly validate societal perceiver- W Kyle Simmons. Interoceptive predictions in the brain. dependent ethical goal functions with the goal to better in- Nature Reviews Neuroscience, 16(7):419, 2015. corporate the requisite variety for AI value alignment. [Barrett et al., 2015] Lisa Feldman Barrett, Christine D In future work, one could extend and refine the discussed methodology, study a more systematic validation approach Wilson-Mendenhall, and Lawrence W Barsalou. The con- for ethical goal functions and perform first experimental stud- ceptual act theory: A roadmap. pages 83–110, 2015. ies. Moreover, the “security of the utility function itself is es- [Barrett, 2017a] Lisa Feldman Barrett. How emotions are sential, due to the possibility of its modification by malevolent made: The secret life of the brain. Houghton Mifflin Har- actors during the deployment phase” [Aliman and Kester, court, 2017. 2019a]. For this purpose, a blockchain-based solution might [Barrett, 2017b] Lisa Feldman Barrett. The theory of con- be advantageous. In addition, it is important to note that structed emotion: an active inference account of intero- even with utility functions exhibiting a sufficient variety for ception and categorization. Social cognitive and affective AI value alignment, it might still be possible for a malicious neuroscience, 12(1):1–23, 2017. attacker to craft adversarial examples against a utility max- imizer at the perception-level which might lead to unethi- [Cabanac, 2002] Michel Cabanac. What is emotion? Be- cal behavior. Besides that, one might first need to perform havioural processes, 60(2):69–83, 2002. policy-by-simulation [Werkhoven et al., 2018] prior to the de- [Cameron et al., 2015] C Daryl Cameron, Kristen A ployment of advanced AI systems equipped with ethical goal Lindquist, and Kurt Gray. A constructionist review of morality and emotions: No evidence for specific links agent alignment via reward modeling: a research direction. between moral content and discrete emotions. Personality arXiv preprint arXiv:1811.07871, 2018. and Social Psychology Review, 19(4):371–394, 2015. [MacLean, 1990] Paul D MacLean. The triune brain in evo- [Christiano et al., 2017] Paul F Christiano, Jan Leike, Tom lution: Role in paleocerebral functions. Springer Science Brown, Miljan Martic, Shane Legg, and Dario Amodei. & Business Media, 1990. Deep reinforcement learning from human preferences. [Miller and Clark, 2018] Mark Miller and Andy Clark. Hap- In Advances in Neural Information Processing Systems, pily entangled: prediction, emotion, and the embodied pages 4299–4307, 2017. mind. Synthese, 195(6):2559–2575, 2018. [Conant and Ross Ashby, 1970] Roger C Conant and [Norman and Bar-Yam, 2018] Joseph Norman and Yaneer W Ross Ashby. Every good regulator of a system must be Bar-Yam. Special Operations Forces: A Global Immune a model of that system. International journal of systems System? In International Conference on Complex Sys- science, 1(2):89–97, 1970. tems, pages 486–498. Springer, 2018. [Eckersley, 2018] Peter Eckersley. Impossibility and Un- [Oosterwijk et al., 2012] Suzanne Oosterwijk, Kristen A certainty Theorems in AI Value Alignment (or why your Lindquist, Eric Anderson, Rebecca Dautoff, Yoshiya AGI should not have a utility function). arXiv preprint Moriguchi, and Lisa Feldman Barrett. States of mind: arXiv:1901.00064, 2018. Emotions, body feelings, and thoughts share distributed [Gilbert and Wilson, 2007] Daniel T Gilbert and Timothy D neural networks. NeuroImage, 62(3):2110–2128, 2012. Wilson. Prospection: Experiencing the future. Science, [Schein and Gray, 2015] Chelsea Schein and Kurt Gray. The 317(5843):1351–1354, 2007. unifying moral dyad: Liberals and conservatives share the [Goodfellow, 2018] Ian Goodfellow. Defense Against the same harm-based moral template. Personality and Social Dark Arts: An overview of adversarial example security Psychology Bulletin, 41(8):1147–1163, 2015. research and future research directions. arXiv preprint [Schein and Gray, 2018] Chelsea Schein and Kurt Gray. The arXiv:1806.04169, 2018. theory of dyadic morality: Reinventing moral judgment [Goodfellow, 2019] Ian Goodfellow. Adversarial Robust- by redefining harm. Personality and Social Psychology ness for AI Safety. https://safeai.webs.upv.es/wp-content/ Review, 22(1):32–70, 2018. uploads/2019/02/2019-01-27-goodfellow.pdf, 2019. [Schein et al., 2016] Chelsea Schein, Neil Hester, and Kurt [Gray et al., 2017] Kurt Gray, Chelsea Schein, and C Daryl Gray. The visual guide to morality: Vision as an in- Cameron. How to think about emotion and morality: cir- tegrative analogy for moral experience, variability and cles, not arrows. Current opinion in psychology, 17:41–46, mechanism. Social and Personality Psychology Compass, 2017. 10(4):231–251, 2016. [Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, Stu- [Seligman, 2012] Martin EP Seligman. Flourish: A vision- art J Russell, Pieter Abbeel, and Anca Dragan. Coopera- ary new understanding of happiness and well-being. Si- tive inverse reinforcement learning. In Advances in neural mon and Schuster, 2012. information processing systems, pages 3909–3917, 2016. [Soares and Fallenstein, 2017] Nate Soares and Benya Fal- [Helion and Pizarro, 2015] Chelsea Helion and David A lenstein. Agent foundations for aligning machine in- Pizarro. Beyond dual-processes: the interplay of reason telligence with human interests: a technical research and emotion in moral judgment. Handbook of neuroethics, agenda. In The Technological Singularity, pages 103–125. pages 109–125, 2015. Springer, 2017. [Hoemann and Barrett, 2019] Katie Hoemann and Lisa Feld- [Taylor et al., 2016] Jessica Taylor, Eliezer Yudkowsky, man Barrett. Concepts dissolve artificial boundaries in the Patrick LaVictoire, and Andrew Critch. Alignment for ad- study of emotion and cognition, uniting body, brain, and vanced machine learning systems. Machine Intelligence mind. Cognition and Emotion, 33(1):67–76, 2019. PMID: Research Institute, 2016. 30336722. [Verbeek, 2001] Bruno Verbeek. Consequentialism, ratio- [Kahneman et al., 1997] Daniel Kahneman, Peter P Wakker, nality and the relevant description of outcomes. Eco- and Rakesh Sarin. Back to Bentham? Explorations of nomics & Philosophy, 17(2):181–205, 2001. experienced utility. The quarterly journal of economics, 112(2):375–406, 1997. [Werkhoven et al., 2018] Peter Werkhoven, Leon Kester, [Kleckner et al., 2017] Ian R Kleckner, Jiahe Zhang, and Mark Neerincx. Telling autonomous systems what to do. In Proceedings of the 36th European Conference on Alexandra Touroutoglou, Lorena Chanes, Chenjie Xia, Cognitive Ergonomics, page 2. ACM, 2018. W Kyle Simmons, Karen S Quigley, Bradford C Dicker- son, and Lisa Feldman Barrett. Evidence for a large-scale [Yudkowsky, 2016] Eliezer Yudkowsky. The AI Alignment brain system supporting allostasis and interoception in Problem: Why it is Hard, and Where to Start. Symbolic humans. Nature human behaviour, 1(5):0069, 2017. Systems Distinguished Speaker, 2016. [Leike et al., 2018] Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable