Requisite Variety in Ethical Utility Functions for AI Value Alignment

                                        Nadisha-Marie Aliman 1 , Leon Kester 2
                                        1
                                          Utrecht University, Utrecht, Netherlands
                                      2
                                        TNO Netherlands, The Hague, Netherlands
                                            nadishamarie.aliman@gmail.com


                          Abstract                               functions crafted by (a representation of) society and assisted
                                                                 by science and technology which has been termed ethical
     Being a complex subject of major importance in AI           goal functions [Aliman and Kester, 2019b; Werkhoven et al.,
     Safety research, value alignment has been studied           2018]. In order to be able to formulate utility functions that
     from various perspectives in the last years. How-           do not violate the ethical intuitions of most entities in a so-
     ever, no final consensus on the design of ethical           ciety, these ethical goal functions will have to be a model of
     utility functions facilitating AI value alignment has       human ethical intuitions. This simple but important insight
     been achieved yet. Given the urgency to identify            can be derived from the good regulator theorem in cybernet-
     systematic solutions, we postulate that it might be         ics [Conant and Ross Ashby, 1970] stating that “every good
     useful to start with the simple fact that for the util-     regulator of a system must be a model of that system”. We
     ity function of an AI not to violate human ethical          believe that instead of learning models of human intuitions in
     intuitions, it trivially has to be a model of these         their apparent complexity and ambiguity, AI Safety research
     intuitions and reflect their variety – whereby the          could also make use of the already available scientific knowl-
     most accurate models pertaining to human entities           edge on the nature of human moral judgements and ethical
     being biological organisms equipped with a brain            conceptions as made available e.g. by neuroscience and psy-
     constructing concepts like moral judgements, are            chology. The human brain did not evolve to facilitate rational
     scientific models. Thus, in order to better assess          decision-making or the experience of emotions, but instead
     the variety of human morality, we perform a trans-          to fulfill the core task of allostasis (anticipating the needs
     disciplinary analysis applying a security mindset to        of the body in an environment before they arise in order to
     the issue and summarizing variety-relevant back-            ensure growth, survival and reproduction) [Barrett, 2017a;
     ground knowledge from neuroscience and psychol-             Kleckner et al., 2017]. Thereby, psychological functions
     ogy. We complement this information by linking              such as cognition, emotion or moral judgements are closely
     it to augmented utilitarianism as a suitable ethical        linked to the predictive regulation of physiological needs of
     framework. Based on that, we propose first practi-          the body [Kleckner et al., 2017] making it indispensable to
     cal guidelines for the design of approximate ethical        consider the embodied nature of morality when aspiring to
     goal functions that might better capture the variety        model it for AI value alignment.
     of human moral judgements. Finally, we conclude
     and address future possible challenges.                        For the purpose of facilitating the injection of requisite
                                                                 knowledge reflecting the variety of human morality in ethical
                                                                 goal functions, Section 2 provides information on the follow-
1   Introduction                                                 ing variety-relevant aspects: 1) the essential role of affect and
AI value alignment, the attempt to implement systems adher-      emotion in moral judgements from a modern construction-
ing to human ethical values has been recognized as highly        ist neuroscience and cognitive science perspective followed
relevant subtask in AI Safety at an international level and      by 2) dyadic morality as a recent psychological theory on
studied by multiple AI and AI Safety researchers across di-      the nature of cognitive templates for moral judgements. In
verse research subareas [Hadfield-Menell et al., 2016; Soares    Section 3, we propose first guidelines on how to approxi-
and Fallenstein, 2017; Yudkowsky, 2016] (a review is pro-        mately formulate ethical goal functions using a recently pro-
vided in [Taylor et al., 2016]). Moreover, the need to in-       posed non-normative socio-technological ethical framework
vestigate value alignment has been included in the Asilomar      grounded in science called augmented utilitarianism [Aliman
AI Principles [2018] with a worldwide support of researchers     and Kester, 2019a] that might be useful to better incorporate
from the field. While value alignment has often been tack-       the requisite variety of human ethical intuitions (especially in
led using reinforcement learning [Abel et al., 2016] (and also   comparison to classical utilitarianism). Thereafter, we pro-
reward modeling [Leike et al., 2018]) or inverse reinforce-      pose how to possibly validate these functions within a socio-
ment learning [Abbeel and Ng, 2004] methods, we focus on         technological feedback-loop [Aliman and Kester, 2019b]. Fi-
the approach to explicitly formulate cardinal ethical utility    nally, in Section 4, we conclude and specify open challenges
                                                                     ity of that sample which should ideally be in line with the
                                                                     society that crafted this utility function. The attacker which
                                                                     has at his disposal the knowledge on human ethical intuitions,
                                                                     can attempt targeted misclassifications at the level of a sin-
                                                                     gle sample or at the level of an ordering of multiple samples
                                                                     whereby the ground-truth are the ethical intuitions of most
                                                                     people in a society. The Law of Requisite Variety from cy-
                                                                     bernetics [Ashby, 1961] states that “only variety can destroy
                                                                     variety”, with other words in order to cope with a certain va-
                                                                     riety of problems or environmental variety, a system needs
                                                                     to exhibit a suitable and sufficient variety of responses. Fig-
                                                                     ure 1 offers an intuitive explanation of this law. Transferring
                                                                     it to the mentioned utility function U , it is for instance con-
                                                                     ceivable that if U does not encode affective information that
                                                                     might lead to a difference in ethical evaluations, an attacker
                                                                     can easily craft a sample which U might misclassify as ethical
                                                                     or unethical or cause U to generate a total ordering of samples
                                                                     that might appear unethical from the perspective of most peo-
Figure 1: Intuitive illustration for the Law of Requisite Variety.   ple. Given that U does not have an influence on the variety of
Taken from [Norman and Bar-Yam, 2018].                               human morality, the only way to respond to the disturbances
                                                                     of the attacker and reduce the variety of possible undesirable
                                                                     outcomes, is by increasing the own variety – which can be
providing incentives for future work.                                achieved by encoding more relevant knowledge.

2       Variety in Embodied Morality                                 2.1   Role of Emotion and Affect in Morality
While value alignment is often seen as a safety problem, it          One fundamental and persistent misconception about human
is possible to interpret and reformulate it as a related secu-       biology (which does not only affect the understanding of the
rity problem which might offer a helpful different perspective       nature of moral judgements) is the assumption that the brain
on the subject emphasizing the need to capture the variety of        incorporates a layered architecture in which a battle between
embodied morality. One possible way to look at AI value              emotion and cognition is given through the very anatomy of
alignment is to consider it as being an attempt to achieve ad-       the “triune brain” [MacLean, 1990] exhibiting three hierar-
vanced AI systems exhibiting adversarial robustness against          chical layers: a reptilian brain on top of which an emotional
malicious adversaries attempting to lead the system to ac-           animalistic paleomammalian limbic system is located and a
tion(s) or output(s) that are perceived as violating human eth-      final rational neomammalian cognition layer implemented in
ical intuitions. From an abstract point of view, one could dis-      the neocortex. This flawed view is not in accordance with
tinguish different means by which an adversary might achieve         neuroscientific evidence and understanding [Barrett, 2017a;
successful attacks: e.g. 1) by fooling the AI at the perception-     Miller and Clark, 2018]. In fact, the assumed reactive and an-
level (in analogy to classical adversarial examples [Goodfel-        imalistic limbic regions in the brain are predictive (e.g. they
low, 2018], this variant has been denoted ethical adversar-          send top-down predictions to more granular cortical regions),
ial examples [Aliman and Kester, 2019a]) which could lead            control the body as well as attention mechanisms while being
to an unethical behavior even if the utility function would          the source of the brain’s internal model of the body [Barrett
have been aligned with human ethical intuitions or 2) sim-           and Simmons, 2015; Barrett, 2017b].
ply by disclosing dangerous (certainly unintended from the              Emotion and cognition do not represent a dichotomy lead-
designer) unethical implications encoded in its utility func-        ing to a conflict in moral judgements [Helion and Pizarro,
tion by targeting specific mappings from perception to output        2015]. Instead, the distinction between the experience of an
or action (this could be understood as ethical adversarial ex-       instance of a concept as belonging to the category of emotions
amples on the utility function itself). While the existence of       versus the category of cognition is grounded in the focus of
point 1) yields one more argument for the importance of re-          attention of the brain [Barrett et al., 2015] whereby “the expe-
search on adversarial robustness at the perception-level for AI      rience of cognition occurs when the brain foregrounds men-
Safety reasons [Goodfellow, 2019] and a sophisticated com-           tal contents and processes” and “the experience of emotion
bination of 1) and 2) might be thinkable, our exemplification        occurs when, in relation to the current situation, the brain
focuses on adversarial attacks of the type 2).                       foregrounds bodily changes” [Hoemann and Barrett, 2019].
   One could consider the explicitly formulated utility func-        The mental phenomenon of actively dynamically simulating
tion U as representing a separate model1 that given a sample,        different alternative scenarios (including anticipatory emo-
outputs a value determining the perceived ethical desirabil-         tions) has also been termed conceptual consumption [Gilbert
                                                                     and Wilson, 2007] and plays a role in decision-making and
    1
     a conceptually similar separation of objective function model   moral reasoning. While emotions are discrete constructions
and optimizing agent has been recently performed for reward mod-     of the human brain, core affect allows a low-dimensional
eling [Leike et al., 2018]                                           experience of interoceptive sensations (sensory array from
within the body) and is a continuous property of conciousness      this human perceives the act. As stated by Schein and Gray,
with the dimensions of valence (pleasantness/unpleasantness)       the dyadic harm-based cognitive template “is rooted in innate
and arousal (activation/deactivation) [Kleckner et al., 2017].     and evolved processes of the human mind; it is also shaped
It has been argued that core affect provides a basis for           by cultural learning, therefore allowing cultural pluralism”.
moral judgements in which different events are qualitatively       Importantly, the nature of this cognitive template reveals that
compared to each other [Cabanac, 2002]. Like other con-            moral judgements besides being perceiver-dependent, might
structed mental states, moral judgements involve domain-           vary across diverse parameters such as especially e.g. in re-
general brain processes which simply put combine 1) the in-        lation to the perception of agent, act and patient in the out-
teroceptive sensory array, 2) the exteroceptive sensory inputs     come of the action. Further, the theory also foresees a pos-
from the environment and 3) past experience/ knowledge for         sible time-dependency of moral judgements by introducing
a goal-oriented situated conceptualization (as tool for allosta-   the concept of a dyadic loop, a feedback cycle resulting in an
sis) [Oosterwijk et al., 2012]. From these key constituents        iterative polarization of moral judgements through social dis-
of mental constructions one can extract the following: con-        cussion modulating the perception of harm as time goes by.
cepts (including morality) are perceiver-dependent and time-       Overall, moral judgements are understood as constructions
dependent. Thereby, affect, (but not emotion [Cameron et           in the same way visual perception, cognition or emotion are
al., 2015]) is a necessary ingredient of every moral judge-        constructed by the human mind. Similarly to the existence
ment. More fundamentally, “the human brain is anatomi-             of variability in visual perception, variability in morality is
cally structured so that no decision or action can be free of      the norm which often leads to moral conflicts [Schein et al.,
interoception and affect” [Barrett, 2017a] – this includes any     2016]. However, the understanding that humans share the
type of thoughts that seem to correspond to the folk terms         same harm-based cognitive template for morality has been
of “rational” and “cold”. Therefore, a utility function without    described as reflecting “cognitive unity in the variety of per-
affect-related parameters might not exhibit a sufficient variety   ceived harm” [Schein and Gray, 2018].
and might lead to the violation of human ethical intuitions.
   Morality cannot be separated from a model of the body,             Analyzing the cognitive template of dyadic morality, one
since the brain constructs the human perception of reality         can deduce that human moral judgements do not only con-
based on what seems of importance to the brain for the pur-        sider the outcome of an action as prioritized by consequen-
pose of allostasis which is inherently strongly linked to inte-    tialist frameworks like classical utilitarianism, nor do they
roception [Barrett, 2017a]. Interestingly, even the imagina-       only consider the state of the agent which is in the focus
tion of future not yet experienced events is facilitated through   of virtue ethics. Furthermore, as opposed to deontological
situated recombinations of sensory-motor and affective na-         ethics, the focus is not only on the nature of the performed ac-
ture in a similar way as the simulation of actually experienced    tion. The main implications for the design of utility functions
events [Addis, 2018]. To sum up, there is no battle between        that should ideally be aligned with human ethical values, is
emotion and cognition in moral judgements. Moreover, there         that they might need to encode information on agent, action,
is also no specific moral faculty in the brain, since moral        patient as well as on the perceivers – especially with regard
judgements are based on domain-general processes within            to the cultural background. This observation is fundamental
which affect is always involved to a certain degree. One could     as it indicates that one might have to depart from classical
obtain insufficient variety in dealing with an adversary craft-    utilitarian utility functions U (s0 ) which are formulated as to-
ing ethical adversarial examples on a utility model U if one       tal orders at the abstraction level of outcomes i.e. states (of
ignores affective parameters. Further crucial parameters for       affairs) s0 . In line with this insight, is the context-sensitive
ethical utility functions could be e.g. of cultural, social and    and perceiver-dependent type of utility functions considering
socio-geographical nature.                                         agent, action and outcome which has been recently proposed
                                                                   within a novel ethical framework denoted augmented utilitar-
2.2   Variety through “Dyadicness”                                 ianism [Aliman and Kester, 2019a] (abbreviated with AU in
The psychological theory of dyadic morality [Schein and            the following). Reconsidering the dyadic morality template
                                                                         d
Gray, 2018] posits that moral judgements are based on a fuzzy      iA → − vP , it seems that in order to better capture the vari-
cognitive template and related to the perception of an inten-      ety of human morality, utility functions – now transferring it
tional agent (iA) causing damage (d) to a vulnerable patient       to the perspective of AI systems – would need to be at least
                     d
(vP ) denoted iA →  − vP . More precisely, the theory pos-         formulated at the abstraction level of a perceiver-dependent
                                                                                                   a
tulates that the perceived immorality of an act is related to      evaluation of a transition s → − s0 leading from a state s to a
                                                                           0
the following three elements: norm violations, negative af-        state s via an action a. We encode the required novel type
fect and importantly perceived harm. According to a study,         of utility function with Ux (s, a, s0 ) with x denoting a specific
the reaction times in describing an act as immoral predict the     perceiver. This formulation could enable an AI system im-
reaction times in categorizing the same act as harmful [Schein     plemented as utility maximizer to jointly consider parameters
and Gray, 2015]. The combination of these basic constituents       specified by a perceiver which are related to its perception of
is suggested to lead to the emergence of a rich diversity of       agent, the action and the consequences of this action on a pa-
moral judgements [Gray et al., 2017]. Dyadicness is under-         tient. Since the need to consider time-dependency has been
stood as a continuum predicting the condemnation of moral          formulated, one would consequently also require to add the
acts. The more a human entity perceives an intentional agent       time dimension to the arguments of the utility function lead-
inflicting damage to a vulnerable patient, the more immoral        ing to Ux ((s, a, s0 ), t).
3     Approximating Ethical Goal Functions                            or actions (for deontological ethics) or the involved agents
While the psychological theory of dyadic morality was useful          (for virtue ethics) does not represent a good model of human
to estimate the abstraction level at which one would at least         ethical intuitions. It is conceivable, that if a utility model U
have to specify utility functions, the closer analysis on the         is defined as utility function U (s0 ), the model cannot possi-
nature of the construction of mental states performed in Sec-         bly exhibit a sufficient variety and might more likely violate
tion 2, abstractly provides a superset of primitive relevant pa-      human ethical intuitions than if it would be implemented as a
rameters that might be critical elements of every moral judge-        context-sensitive utility function Ux (s, a, s0 ). (Beyond that, it
ment (being a mental state). Given a perceiver x, the com-            has been argued that consequentialism implies the rejection of
ponents of this set are the following subsets: 1) parameters          “dispositions and emotions, such as regret, disappointment,
encoding the interoceptive sensory array Bx (from within the          guilt and resentment” from “rational” deliberation [Verbeek,
body) which are accessible to the human consciousness via             2001] and should i.a. for this reason be disentangled from the
the low-dimensional core affect, 2) the exteroceptive sensory         notion of rationality for which it cannot represent a plausible
array Ex encoding information from the environment and 3)             requirement.)
the prior experience Px encoding memories. Moreover, these               It is noteworthy that in the context of reinforcement learn-
set of parameters obviously vary in time. However, to sim-            ing (e.g. in robotics) different types of reward functions are
plify, it has been suggested within the mentioned AU frame-           usually formulated ranging from R(s0 ) to R(s, a, s0 ). For the
work, that ethical goal functions will have to be updated reg-        purpose of ethical utility functions for advanced AI systems
ularly (leading to a so-called socio-technological feedback-          in critical application fields, we postulate that one does not
loop [Aliman and Kester, 2019b]) in the same way as votes             have the choice to specify the abstraction level of the utility
take place at regular intervals in a democracy. One could sim-        function, since for instance U (s0 ) might lead to safety risks.
ilarly assume that this regular update will be sufficient to re-      Christiano et al. [2017] considered the elicitation of human
flect a relevant change in moral opinion and perception.              preferences on trajectory (state-action pairs) segments of a
                                                                      reinforcement learning agent i.a. realized by human feedback
3.1    Injecting Requisite Variety in Utility                         on short movies. For the purpose of utility elicitation in an
                                                                      AU framework exemplarily using a naive model as specified
For simplicity, we assume that the set of parameters Bx , Px          in equation (1), people will similarly have to assign utility to a
and Ex are invariant during the utility assignment process            movie representing a transition in the future (either in a men-
in which a perceiver x has to specify the ethical desirabil-          tal mode or augmented by technology such as VR or AR [Al-
                        a
ity of a transition s →− s0 by mapping it to a cardinal value         iman and Kester, 2019b]). However, it is obvious that this
Ux (s, a, s0 ) obtained by applying a not-nearer defined type of      naive utility assignment would not scale in practice. More-
scientifically determined transformation vx (chosen by x) on          over, it has not yet been specified how to aggregate ethical
the mental state of x. This results in the following naive and        goal functions at a societal level. In the following Subsec-
simplified mapping however adequately reflecting the prop-            tion 3.2, we will address these issues by proposing a practica-
erty of mental-state-dependency formulated in the AU frame-           ble approximation of the utility function in (1) and a possible
work (the required dependency of ethical utility functions on         societal aggregation of this approximate solution.
parameters of the own mental state function mx in order to
avoid perverse instantiation scenarios [Aliman and Kester,            3.2   Approximation, Aggregation and Validation
2019a]):                                                              So far, it has been stated throughout the paper that one has
        Ux (s, a, s0 ) = vx (mx ((s, a, s0 ), Bx , Px , Ex ))   (1)   to adequately increase the variety of a utility function meant
                                                                      to be ethical in order to avoid violations of human ethical
   Conversely, the utility function of classical utilitarianism       intuitions and vulnerability to attackers crafting ethical ad-
is only defined at the impersonal and context-independent             versarial examples against the model. However, it is impor-
abstraction level of U (s0 ) which has been argued to lead to         tant to note that despite the negatively formulated motivation
both perverse instantiation problem but also to the repugnant         of the approach, the aim is to craft a utility model U which
conclusion and related impossibility theorems in population           represents a better model of human ethical intuitions in gen-
ethics for consequentialist frameworks which do not apply to          eral, thus ranging from samples that are perceived as highly
mental-state-dependent utility functions [Aliman and Kester,          unethical to those that are assigned a high ethical desirabil-
2019a]. The idea to restrict human ethical utility functions          ity. In order to craft practical solutions that lead to optimal
to the considerations of outcomes of actions alone – ignor-           results, it might be advantageous to perform a thought ex-
ing affective parameters of the own current self – as practiced       periment imagining a utopia and from that impose practical
in classical utilitarianism while later referring to the result-      constraints on its viability. It might not seem realistic to de-
ing total orders with emotionally connoted adjectives such            liberate a future utopia 1 as a sustainable society which is
as “repugnant” or “perverse” has been termed the perspecti-           stable across a very large time interval in which every human
val fallacy of utility assignment [Aliman and Kester, 2019b].         being acts according to the ethical intuitions of all humans
The use of consequentialist utility functions affected by the         including the own and every artificial intelligent system ful-
impossibility theorems of Arrhenius [2000] has been justifi-          fills the ethical intuitions of all humans. However, it seems
ably identified by Eckersley [2018] as a safety risk if used          more likely that within a utopia 2 being a stable society in
in AI systems without more ado. It seems that the isolated            which every human achieves a high level of a scientific defi-
consideration of outcomes of actions (for consequentialism)           nition of well-being (such as e.g. PERMA [Seligman, 2012])
with artificial agents acting as to maximize context-sensitive      being), 2) agreement on superset O of scientifically measur-
utility according to which (human or artificial) agents pro-        able and relevant parameters (encoding e.g. affective, dyadic,
moting the (measurable) well-being of human patients is re-         cultural, social, political, socio-geographical but importantly
garded as the most utile type of events, the ethical intuitions     also law-relevant information) that are considered as impor-
of humans might tend to get closer to each other. The reason        tant across the whole society, 3) specification of personal util-
being that the variety of human moral judgements might in-          ity functions for each member n of a society of N members
terestingly decrease since it is conceivable that they will tend    allowing personalized and tailored combinations of a sub-
to exhibit more similar prior experiences (all imprinted by         set of O, 4) aggregation to a societal ethical goal function
well-being) and have more similar environments (full of sta-        UT otal (s, a, s0 ). Taken together, these considerations lead us
ble people with a high level of well-being). The main factor        to the following possible approximation for an aggregated so-
drawing differences could be the body – especially biologi-         cietal ethical goal function given a domain:
cal factors. However, the parameters related to interoception
might be closer to each other, since all humans exhibit a high                                           j
                                                                                                       N X
                                                                                                       X
                                                                                               0
level of well-being which classically includes frequent pos-                    UT otal (s, a, s ) =             win ffni (Ci )    (2)
itive affect. It is conceivable that with time, such a society                                         n=1 i=1
could converge towards the utopia 1.
                                                                    with N standing for the number of participating entities in so-
   In the following, we will denote the mentioned utopia-           ciety, Ci = (pi1 , pi2 , ..., pim ) being a cluster of m ≥ 1 corre-
related ideal cognitive template of a (human or artificial)         lated parameters (whereby independent factors are assigned
agent A performing an act w that contributes to the well-           an own cluster each) and f representing a set of preference
                                         w
being of a human patient P with A −      → P in analogy to the      functions (form functions). For instance f = {f1 , f2 , ..., ff }
                                                           w
cognitive template of dyadic morality. (Thereby, A −       → P      where f1 could be a linear transformation, f2 a concave, f3
is perceiver-dependent i.a. because psychological measures          a convex preference function and so on. Each entity n as-
of well-being include subjective and self-reported elements         signs a weight win to a form function ffni applied to a clus-
                                                                                                          Pj
such as e.g. life satisfaction or furthermore positive emo-         ter of parameters Ci whereby i=1 win = 1. We define
tions [Seligman, 2012].) Augmented utilitarianism foresees          O = {C1 , C2 , ...} as the superset of all parameters considered
the need to at least depict a final goal at the abstraction level   in the overall aggregated utility function. Moreover, a ∈ A
of a perceiver-dependent function on a transition as reflected      with A representing the foreseen discrete action space at the
                                                    w
in Ux (s, a, s0 ). The ideal cognitive template A − → P formu-      disposal of the AI. (It is important to note that while the AI
lated for utopia 2 by which it has been argued that a decrease      could directly perform actions in the environment, it could
in the variety of human morality might be achievable in the         also be used for policy-making and provide plans for human
long-term exhibits an abstraction level that is compatible with     agents.) Further, we consider a continuous state space with
Ux (s, a, s0 ).                                                     the states s and s0 ∈ S = R|O| . Other aspects including
   A thinkable strategy for the design of a utility model U that    e.g. legal rules and norms on the action space can be imposed
is robust against ethical adversarial examples and a model          as constraints on the utility function. In a nutshell, the util-
of human ethical intuitions is to try to adequately increase        ity aggregation process can be understood as a voting process
its variety using relevant scientific knowledge and to com-         in which each participating individual n distributes his vote
plementarily attempt to decrease the variety of human moral         across scientifically measurable clusters of parameters Ci on
judgements for instance by considering A −
                                              w
                                              → P as high-level     which he applies a preference function ffni to which weights
final goal such that the described utopia 2 ideally becomes         win are assigned as identified as relevant by n given a to be
                                                                                                                             w
a self-fulfilling prophecy. For it to be realizable in prac-        approximated high-level societal goal (such as A −       → P ). In
tice, we suggest that the appropriateness of a given aggre-         short, people do not have to agree on personal preferences
gated societal ethical goal function could be approximately         and weightings, but only on a superset of acceptable param-
validated against its quantifiable impact on well-being for         eters, an aggregation method and an overall validation mea-
society across the time dimension. Since it seems however           sure. (Note that instead of involving society as a whole for
unfeasible to directly map all important transitions of a do-       each domain, the utility elicitation procedure can as well be
main to their effect on the well-being of human entities, we        approximated by a transdisciplinary set of representative ex-
propose to consider perceiver-specific and domain-specific          perts (e.g. from the legislative) crafting expert ethical goal
utility functions indicating combined preferences that each         functions that attempt to ideally emulate UT otal (s, a, s0 )).
perceiver x considers to be relevant for well-being from the           Finally, it is important to note that the societal ethical goal
viewpoint of x himself in that specific domain. For these           function specified in (2) will need to be updated (and evalu-
combined utility functions to be grounded in science, they          tated) at regular intervals due to the mental-state-dependency
will have to be based on scientifically measurable parame-          of utility entailing time-dependency [Aliman and Kester,
ters. We postulate that a possible aggregation at a societal        2019a]. This leads to the necessity of a socio-technological
level could be performed by the following steps: 1) agree-          feedback-loop which might concurrently offer the possibil-
ment on a common validation measure of an ethical goal              ity of a dynamical ethical enhancement [Aliman and Kester,
function (for instance the temporal development of societal         2019b; Werkhoven et al., 2018]. Pre-deployment, one could
satisfaction with AI systems in a certain domain or with future     in the future attempt a validation via selected preemptive sim-
AGI systems, their aptitude to contribute to sustainable well-      ulations [Aliman and Kester, 2019b] in which (a represen-
tation of) society experiences simulations of future events          functions for safety reasons. Last but not least, the usage of
(s, a, s0 ) as movies, immersive audio-stories or later in VR        ethical goal functions might represent an interesting approach
and AR environments. During these experiences, one could             to the AI coordination subtask in AI Safety, since an interna-
approximately measure the temporal profile of the so-called          tional use of this method might contribute to reduce the AI
artificially simulated future instant utility [Aliman and Kester,    race to the problem-solving ability dimension [Aliman and
2019b] denoted UT otalAS being a potential constituent of fu-        Kester, 2019b].
ture well-being. Thereby, UT otalAS refers to the instant util-
ity [Kahneman et al., 1997] experienced during a technology-         References
aided simulation of a future event whereby instant utility           [Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y Ng.
refers to the affective dimension of valence at a certain time
                                                                       Apprenticeship learning via inverse reinforcement learn-
t. The temporal integral that a measure of UT otalAS could
                                                                       ing. In Proceedings of the twenty-first international con-
approximate is specified as:
                                                                       ference on Machine learning, page 1. ACM, 2004.
                                      N Z T
                                      X                              [Abel et al., 2016] David Abel, James MacGlashan, and
              UT otalAS (s, a, s0 ) ≈          In (t)dt      (3)
                                         t0
                                                                       Michael L Littman. Reinforcement learning as a frame-
                                   n=1
                                                                       work for ethical decision making. In Workshops at the
with t0 referring to the starting point of experiencing the simu-      Thirtieth AAAI Conference on Artificial Intelligence, 2016.
lation of the event (s, a, s0 ) augmented by technology (movie,      [Addis, 2018] Donna Rose Addis. Are episodic memories
audio-story, AR, VR) and T the end of this experience. In (t)
                                                                       special? On the sameness of remembered and imagined
represents the valence dimension of core affect experienced
                                                                       event simulation. Journal of the Royal Society of New
by n at time t. Finally, post-deployment, the ethical goal
                                                                       Zealand, 48(2-3):64–88, 2018.
function of an AI system can be validated using the valida-
tion measure agreed upon before utility aggregation (such as         [Aliman and Kester, 2019a] Nadisha-Marie Aliman and
the temporal development of societal-level satisfaction with           Leon Kester. Augmented Utilitarianism for AGI Safety.
an AI system, well-being or even the perception of dyadic-             In International Conference on Artificial General Intelli-
ness) that has to be a priori determined.                              gence, page to appear. Springer, 2019.
                                                                     [Aliman and Kester, 2019b] Nadisha-Marie Aliman and
4   Conclusion and Future Work                                         Leon Kester. Transformative AI Governance and AI-
In this paper, we motivated the need in AI value alignment             Empowered Ethical Enhancement Through Preemptive
to attempt to model utility functions capturing the variety of         Simulations.        Delphi - Interdisciplinary Review of
human moral judgements through the integration of relevant             Emerging Technologies, 2(1):23–29, 2019.
scientific knowledge – especially from neuroscience and psy-         [Arrhenius, 2000] Gustaf Arrhenius. An impossibility the-
chology – (instead of learning) in order to avoid violations of        orem for welfarist axiologies. Economics & Philosophy,
human ethical intuitions. We reformulated value alignment as           16(2):247–266, 2000.
a security task and introduced the requirement to increase the       [Ashby, 1961] W Ross Ashby. An introduction to cybernet-
variety within classical utility functions positing that a util-
                                                                       ics. Chapman & Hall Ltd, 1961.
ity function which does not integrate affective and perceiver-
dependent dyadic information does not exhibit sufficient va-         [Asilomar, 2018] AI Asilomar. Principles.(2017). In Prin-
riety and might not exhibit robustness against correspond-             ciples developed in conjunction with the 2017 Asilomar
ing adversaries. Using augmented utilitarianism as a suitable          conference [Benevolent AI 2017], 2018.
non-normative ethical framework, we proposed a methodol-             [Barrett and Simmons, 2015] Lisa Feldman Barrett and
ogy to implement and possibly validate societal perceiver-             W Kyle Simmons. Interoceptive predictions in the brain.
dependent ethical goal functions with the goal to better in-           Nature Reviews Neuroscience, 16(7):419, 2015.
corporate the requisite variety for AI value alignment.
                                                                     [Barrett et al., 2015] Lisa Feldman Barrett, Christine D
   In future work, one could extend and refine the discussed
methodology, study a more systematic validation approach               Wilson-Mendenhall, and Lawrence W Barsalou. The con-
for ethical goal functions and perform first experimental stud-        ceptual act theory: A roadmap. pages 83–110, 2015.
ies. Moreover, the “security of the utility function itself is es-   [Barrett, 2017a] Lisa Feldman Barrett. How emotions are
sential, due to the possibility of its modification by malevolent      made: The secret life of the brain. Houghton Mifflin Har-
actors during the deployment phase” [Aliman and Kester,                court, 2017.
2019a]. For this purpose, a blockchain-based solution might          [Barrett, 2017b] Lisa Feldman Barrett. The theory of con-
be advantageous. In addition, it is important to note that             structed emotion: an active inference account of intero-
even with utility functions exhibiting a sufficient variety for        ception and categorization. Social cognitive and affective
AI value alignment, it might still be possible for a malicious         neuroscience, 12(1):1–23, 2017.
attacker to craft adversarial examples against a utility max-
imizer at the perception-level which might lead to unethi-           [Cabanac, 2002] Michel Cabanac. What is emotion? Be-
cal behavior. Besides that, one might first need to perform            havioural processes, 60(2):69–83, 2002.
policy-by-simulation [Werkhoven et al., 2018] prior to the de-       [Cameron et al., 2015] C Daryl Cameron, Kristen A
ployment of advanced AI systems equipped with ethical goal             Lindquist, and Kurt Gray. A constructionist review of
  morality and emotions: No evidence for specific links            agent alignment via reward modeling: a research direction.
  between moral content and discrete emotions. Personality         arXiv preprint arXiv:1811.07871, 2018.
  and Social Psychology Review, 19(4):371–394, 2015.            [MacLean, 1990] Paul D MacLean. The triune brain in evo-
[Christiano et al., 2017] Paul F Christiano, Jan Leike, Tom        lution: Role in paleocerebral functions. Springer Science
  Brown, Miljan Martic, Shane Legg, and Dario Amodei.              & Business Media, 1990.
  Deep reinforcement learning from human preferences.           [Miller and Clark, 2018] Mark Miller and Andy Clark. Hap-
  In Advances in Neural Information Processing Systems,            pily entangled: prediction, emotion, and the embodied
  pages 4299–4307, 2017.                                           mind. Synthese, 195(6):2559–2575, 2018.
[Conant and Ross Ashby, 1970] Roger C Conant and                [Norman and Bar-Yam, 2018] Joseph Norman and Yaneer
  W Ross Ashby. Every good regulator of a system must be           Bar-Yam. Special Operations Forces: A Global Immune
  a model of that system. International journal of systems         System? In International Conference on Complex Sys-
  science, 1(2):89–97, 1970.                                       tems, pages 486–498. Springer, 2018.
[Eckersley, 2018] Peter Eckersley. Impossibility and Un-
                                                                [Oosterwijk et al., 2012] Suzanne Oosterwijk, Kristen A
  certainty Theorems in AI Value Alignment (or why your
                                                                   Lindquist, Eric Anderson, Rebecca Dautoff, Yoshiya
  AGI should not have a utility function). arXiv preprint
                                                                   Moriguchi, and Lisa Feldman Barrett. States of mind:
  arXiv:1901.00064, 2018.
                                                                   Emotions, body feelings, and thoughts share distributed
[Gilbert and Wilson, 2007] Daniel T Gilbert and Timothy D          neural networks. NeuroImage, 62(3):2110–2128, 2012.
  Wilson. Prospection: Experiencing the future. Science,        [Schein and Gray, 2015] Chelsea Schein and Kurt Gray. The
  317(5843):1351–1354, 2007.
                                                                   unifying moral dyad: Liberals and conservatives share the
[Goodfellow, 2018] Ian Goodfellow. Defense Against the             same harm-based moral template. Personality and Social
  Dark Arts: An overview of adversarial example security           Psychology Bulletin, 41(8):1147–1163, 2015.
  research and future research directions. arXiv preprint
                                                                [Schein and Gray, 2018] Chelsea Schein and Kurt Gray. The
  arXiv:1806.04169, 2018.
                                                                   theory of dyadic morality: Reinventing moral judgment
[Goodfellow, 2019] Ian Goodfellow. Adversarial Robust-             by redefining harm. Personality and Social Psychology
  ness for AI Safety. https://safeai.webs.upv.es/wp-content/       Review, 22(1):32–70, 2018.
  uploads/2019/02/2019-01-27-goodfellow.pdf, 2019.
                                                                [Schein et al., 2016] Chelsea Schein, Neil Hester, and Kurt
[Gray et al., 2017] Kurt Gray, Chelsea Schein, and C Daryl         Gray. The visual guide to morality: Vision as an in-
  Cameron. How to think about emotion and morality: cir-           tegrative analogy for moral experience, variability and
  cles, not arrows. Current opinion in psychology, 17:41–46,       mechanism. Social and Personality Psychology Compass,
  2017.                                                            10(4):231–251, 2016.
[Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, Stu-      [Seligman, 2012] Martin EP Seligman. Flourish: A vision-
  art J Russell, Pieter Abbeel, and Anca Dragan. Coopera-          ary new understanding of happiness and well-being. Si-
  tive inverse reinforcement learning. In Advances in neural       mon and Schuster, 2012.
  information processing systems, pages 3909–3917, 2016.
                                                                [Soares and Fallenstein, 2017] Nate Soares and Benya Fal-
[Helion and Pizarro, 2015] Chelsea Helion and David A              lenstein. Agent foundations for aligning machine in-
  Pizarro. Beyond dual-processes: the interplay of reason          telligence with human interests: a technical research
  and emotion in moral judgment. Handbook of neuroethics,          agenda. In The Technological Singularity, pages 103–125.
  pages 109–125, 2015.                                             Springer, 2017.
[Hoemann and Barrett, 2019] Katie Hoemann and Lisa Feld-        [Taylor et al., 2016] Jessica Taylor, Eliezer Yudkowsky,
  man Barrett. Concepts dissolve artificial boundaries in the      Patrick LaVictoire, and Andrew Critch. Alignment for ad-
  study of emotion and cognition, uniting body, brain, and         vanced machine learning systems. Machine Intelligence
  mind. Cognition and Emotion, 33(1):67–76, 2019. PMID:            Research Institute, 2016.
  30336722.
                                                                [Verbeek, 2001] Bruno Verbeek. Consequentialism, ratio-
[Kahneman et al., 1997] Daniel Kahneman, Peter P Wakker,
                                                                   nality and the relevant description of outcomes. Eco-
  and Rakesh Sarin. Back to Bentham? Explorations of               nomics & Philosophy, 17(2):181–205, 2001.
  experienced utility. The quarterly journal of economics,
  112(2):375–406, 1997.                                         [Werkhoven et al., 2018] Peter Werkhoven, Leon Kester,
[Kleckner et al., 2017] Ian R Kleckner, Jiahe Zhang,               and Mark Neerincx. Telling autonomous systems what to
                                                                   do. In Proceedings of the 36th European Conference on
  Alexandra Touroutoglou, Lorena Chanes, Chenjie Xia,
                                                                   Cognitive Ergonomics, page 2. ACM, 2018.
  W Kyle Simmons, Karen S Quigley, Bradford C Dicker-
  son, and Lisa Feldman Barrett. Evidence for a large-scale     [Yudkowsky, 2016] Eliezer Yudkowsky. The AI Alignment
  brain system supporting allostasis and interoception in          Problem: Why it is Hard, and Where to Start. Symbolic
  humans. Nature human behaviour, 1(5):0069, 2017.                 Systems Distinguished Speaker, 2016.
[Leike et al., 2018] Jan Leike, David Krueger, Tom Everitt,
  Miljan Martic, Vishal Maini, and Shane Legg. Scalable