=Paper= {{Paper |id=Vol-2282/EXAG_122 |storemode=property |title=Learning to Generate Natural Language Rationales for Game Playing Agents |pdfUrl=https://ceur-ws.org/Vol-2282/EXAG_122.pdf |volume=Vol-2282 |authors=Upol Ehsan,Pradyumna Tambwekar,Larry Chan,Brent Harrison,Mark O. Riedl |dblpUrl=https://dblp.org/rec/conf/aiide/EhsanTCHR18 }} ==Learning to Generate Natural Language Rationales for Game Playing Agents== https://ceur-ws.org/Vol-2282/EXAG_122.pdf
    Learning to Generate Natural Language Rationales for Game Playing Agents

                              Upol Ehsan∗‡‡ , Pradyumna Tambwekar∗† , Larry Chan† ,
                                      Brent Harrison‡ , and Mark O. Riedl†
                                          ‡‡
                                             Department of Information Science, Cornell University
                                  †
                                      School of Interactive Computing, Georgia Institute of Technology
                                        ‡
                                          Department of Computer Science, University of Kentucky


                           Abstract                                     explanations. This is a potentially powerful tool that
                                                                        could be used to create NPCs that can provide human
  Many computer games feature non-player character (NPC)                understandable explanations for their own actions, without
  teammates and companions; however, playing with or against
  NPCs can be frustrating when they perform unexpectedly.
                                                                        changing the underlying decision-making algorithms. This
  These frustrations can be avoided if the NPC has the ability          in turn could give users more confidence in NPCs and
  to explain its actions and motivations. When NPC behavior             game playing agents and make NPCs and agents more
  is controlled by a black box AI system it can be hard                 understandable and relatable.
  to generate the necessary explanations. In this paper, we                In the work by Ehsan et al., however, the rationale
  present a system that generates human-like, natural language          generation model was trained using a semi-synthetic dataset
  explanations—called rationales—of an agent’s actions in a             by developing a grammar that could generate variations
  game environment regardless of how the decisions are made             of actual human explanations to train their machine.
  by a black box AI. We outline a robust data collection                While their results were promising, creating the grammar
  and neural network training pipeline that can be used to              necessary to construct the requisite training examples is a
  gather think-aloud data and train a rationale generation model
  for any similar sequential turn based decision making task.
                                                                        costly endeavor in terms of authorial effort. We build on this
  A human-subject study shows that our technique produces               work by developing a pipeline to automatically acquire a
  believable rationales for an agent playing the game, Frogger.         corpus of human explanations that can be used to train a
  We conclude with insights about how people perceive                   rationale generation model to explain the actions of NPCs
  automatically generated rationales.                                   and game playing agents. In this paper, we describe our
                                                                        automated explanation corpus collection technique, neural
                                                                        rationale generation model, and present the results of a
                       Introduction                                     human-subjects study of human perceptions of generated
Non-player characters (NPCs) are interactive, autonomous                rationales in the game, Frogger.
agents that play critical roles in most modern video games,
and are often seen as one crucial component of an engaging                                   Related Work
player experience. As NPCs are given more autonomy                      Adaptive team-mate/adversary cooperation in games has
to make decisions, the likelihood that they perform in                  often been explored through the lens of decision making [2].
an unexpected manner increases. These situations risk                   Researchers have looked to incorporate adaptive difficulty
interrupting a player’s engagement in the game world as they            in games (cf. [3, 16]) as well as build NPCs which evolve
attempt to justify the reasoning behind the unexpected NPC              by learning a player’s profile as ways to improve the players
behavior. One method to address this side-effect of increased           experience [7, 15]. What is missing from this analysis is the
autonomy is to construct NPCs that have the ability to                  conversational engagement that comes with collaborating
explain their own actions and motivations for acting.                   with another human player.
   The generation of natural language explanations for                     NPCs that can communicate in natural language have
autonomous agents is challenging when the agent is a                    previously been explored using classical machine learning
black-box AI, meaning that one doesn’t have access to                   techniques. These methods often undertake a rule based
the agent’s decision-making process. Even if access were                or probabilistic modeling approach. Buede et al. combine
possible, the mapping between inputs and decisions could                natural language processing with dynamic probabilistic
be difficult for people to interpret. Work by Ehsan et al. [8]          models to maximize rapport between two conversing agents
showed that machine learning models can be trained to                   [6]. Prior work has also shown the capacity to use a
provide relevant and satisfactory rationales for their actions          rule-based system to create a conversational character
using examples of human behavior and human-provided                     generator [12]. Both of these methods, however, have
   ∗
    Denotes equal contribution.                                         a high degree of hand-authoring involved in generating
                                                                        these models. Our work can generate NPCs with similar
                                                                        communicative capabilities with minimal hand-authoring.
Figure 1: End to End Pipeline for training a system that can
generate explanations.


   Explainable AI has attracted interest from researchers        Figure 2: Players take an action and verbalize their rationale
across various domains. The authors of [1] conduct a             for that action. (1) After taking each action, the game
comprehensive survey on burgeoning trends in explainable         pauses for 10 seconds. (2) Speech-to-text transcribes the
and intelligible systems research. Certain intelligible          participant’s rationale for the action. (3) Participants can
systems researchers look to use model-agnostic methods           view their transcribed rationales near-real time and edit
to add transparency to the latent technology [13, 17].           them, if needed.
Other researchers use visual representations to interpret the
decision-making process of a machine learning system [9].
We situate our system as an agent that unpacks the thought       1. Create a think-aloud protocol in which players provide
process of a human player, if they were to play the game.           natural rationales for their actions.
   Evaluation of explainable AI systems can be difficult
because the appropriateness of an explanation is subjective.     2. Design an intuitive player experience that facilitates
One approach to evaluating such systems was proposed in             accurate matching of the participants’ utterances to the
[5]. They presented participants with different fictionalized       appropriate state in the environment.
explanations for the same decision and measured perceived           To train an agent to generate rationales we need data
levels of justice among their participants. We adopt a similar   linking game states and actions to their corresponding
procedure to measure the quality of generated rationales         natural language explanations. To achieve this goal, we
versus alternate baseline rationales.                            built a modified version of Frogger in which players
                                                                 simultaneously play the game and also explain each of their
         Learning to Generate Rationales                         actions. The entire process is divided into three phases: (1) A
We define a rationale as an explanation that justifies an        guided tutorial, (2) rationale collection, and (3) transcribed
action based on how a human would think. These rationales        explanation review.
do not reveal the true decision making process of an agent,         During the guided tutorial, our interface provides
but still provide insights about why an agent made a decision    instruction on how to play through the game, how to provide
in a form that is easy for non-experts to understand.            natural language explanations, and how to review/modify
   Rationale generation requires translating events in the       any explanations they have given. This helps ensure that
game environment into natural language outputs. Our              users are familiar with the interface and its use before they
approach to rationale generation involves two steps:             begin providing explanations.
(1) collect a corpus of think-aloud data from players who           During explanation collection, users play through the
explained their actions in a game environment; and (2) use       game while explaining their actions out loud. Figure 2 shows
this corpus to train an encoder-decoder network to generate      the game embedded into the explanation collection interface.
plausible rationales for any action taken by an agent (see       To help couple explanations with actions, the game pauses
Figure 1).                                                       for 10 seconds after an action is taken. During this time,
                                                                 the player’s microphone automatically turns on and the
Data Collection Interface                                        player is asked to explain their most recent action while a
There is no readily available dataset for the task of            speech-to-text library transcribes the explanation.
learning to generate explanations. Thus, we developed a             Participants can view their transcribed text and edit it
methodology to collect live “think-aloud” data from players      if necessary. During preliminary testing, we observed that
as they played through a game. This section covers the two       players often repeat a move and the explanation is the
objectives of our data collection endeavor:                      same. For ease, participants can indicate that the explanation
                                                                    state and action information into natural language rationales.
                                                                       The encoder and decoder are both recurrent neural
                                                                    networks (RNN) comprised of GRU cells. The decoder
                                                                    network uses an additional attention mechanism [11] to
                                                                    learn to weight the importance of different components of
                                                                    the input with regard to their effect on the output.
                                                                       To simplify the learning process, the state of the game
                                                                    environment is converted into a sequence of symbols where
                                                                    each symbol represents a type of sprite. To this, we append
                                                                    information concerning Frogger’s position, the most recent
                                                                    action taken, and the number of lives the player has left to
Figure 3: Players can step-through each of their                    create the input representation X. On top of this network
action-rationale pairs and edit if necessary. (1) Players           structure, we vary the input configurations with the intention
can watch a replay of their actions while editing their             of producing varying styles of rationales. These two
rationales. (2) Players use these buttons to control the flow       configurations are titled the focused view configuration and
of their step-through. (3) The rationale for the current action     the complete-view configuration and are used throughout the
gets highlighted for review.                                        experiments presented in this paper.
                                                                    Focused-view Configuration In this configuration we
                                                                    used a windowed representation of the grid, i.e. only a
accompanying their most recent explanation is the same as           7 × 7 window around the Frog was used in the input. Both
that of the last action performed.                                  playing an optimal game of Frogger and generating relevant
   During transcribed explanation review, users are given           explanations based on the current action taken typically only
one final opportunity to review and edit the explanations           requires this much local context. Therefore providing the
given during gameplay (see Figure 3). Players can step              agent with only the window around Frogger helps the agent
through all of the actions they performed in the game and           produce explanations grounded in it’s neighborhood. In this
see their accompanying transcribed explanations so they can         configuration we prioritized rationales focused on short term
see the game context in which their explanations were given.        awareness over long term planning.
   The interface is designed so that no manual
hand-authoring/editing of our data was required before              Complete-view        Configuration The        complete-view
pushing it into our machine learning model. Throughout              configuration is an alternate setup that provides the entire
the game, players were given the opportunity to organically         game board as context for the rationale generation. There
edit their own data without impeding their work-flow. This          are two differences between this configuration and the
added layer of frictionless editing was crucial in ensuring         focused-view configuration. First, instead of showing the
that we can directly input the collected data into the network      network only a window of the game, we use the entire
with zero manual cleaning.                                          game screen as a part of the input. The agent now has the
   One core strength that facilitates transferability is that our   opportunity to learn which other long-term factors in the
pipeline is environment and domain agnostic. While we use           game may influence it’s rationale. Second, we added noise
Frogger as a test environment in our experiments, a similar         to each game state to force the network to generalize when
user experience can be designed using other turn-based              learning to generate rationales and give the model equal
environments with minimal effort.                                   opportunity to consider factors from all sectors of the game
                                                                    screen. In this case noise was introduced by replacing input
Neural Translation Model                                            grid values with dummy values. For each grid element,
                                                                    there was a 20% chance that it would get replaced with a
We use an encoder-decoder network to teach our network              dummy value.
to generate relevant natural language explanations for any
given action. These kinds of networks are commonly used
for machine translation tasks or dialogue generation, but                Human Perception of Rationales Study
their ability to understand sequential dependencies between         In this section, we attempt to assess whether the
the input and the output make it suitable for out task. Our         rationales generated by our system outperform baselines.
encoder decoder architecture is similar to that used in [8].        We further attempt to understand the underlying components
The network learns how to translate the input game state            that influence the difference in the perceptions of the
representation X = x1 , x2 , ..., xn , comprised of the sprite      generated rationales along four dimensions of human
representation of the game combined with other influencing          factors: confidence, human-likeness, adequate justification,
factors, into an output explanation as a sequence of words          and understandability. Frogger is a good candidate for our
Y = y1 , y2 , ..., ym where yi is a word. The input X has           experimental design of a rationale generation pipeline for
a fixed size of 261 tokens encompassing the game state              general sequential decision making tasks because it is a
representation, lives left and the location of the frog. The        simple Markovian environment; that is, the reasons for each
vocabulary sizes for the encoder and the decoder are 491 and        action can be easily separated, making it an ideal stepping
1104 respectively. Thus our network learns to translate game        stone towards a real world environment.
                                                                  D1. Confidence: This rationale makes me confident in the
                                                                        character’s ability to perform it’s task.
                                                                  D2. Human-likeness: This rationale looks like it was made
                                                                        by a human.
                                                                  D3. Adequate justification: This rationale adequately
                                                                        justifies the action taken.
                                                                  D4. Understandability: This rationale helped me
                                                                        understand why the agent behaved as it did.
                                                                  Response options on the Likert scale ranged from ”strongly
                                                                  disagree” to ”strongly agree.” In a free-text field, they
                                                                  explained why the ratings they gave for a particular a set
Figure 4: Screenshot from user study (setup 2) depicting          of three rationales were similar or different. After answering
the action taken and the rationales: P = Random, Q =              these questions, they provided demographic information.
Exemplary, R = Candidate
                                                                  Quantitative Results and Analysis
                                                                  We used a multi-level model to analyze both
   To gather the training set of game state annotations we
                                                                  between-subjects       and     within-subjects      variables.
deployed our data collection pipeline on Amazon Turk
                                                                  There were significant main effects of rationale
Prime [10]. From 60 participants we collected over 2000
                                                                  style (χ2 (2) = 594.80, p < .001) and dimension
samples of human explanations corresponding to images of
                                                                  (χ2 (2) = 66.86, p < .001) on the ratings. The
the game when the explanations were made. This comprised
                                                                  main effect of experimental group was not significant
the corpus with which we trained the encoder-decoder
                                                                  (χ2 (1) = 0.070, p = 0.79). Figure 5 shows the
rationale generation network. The parallel corpus of the
                                                                  average responses to each question for the two different
collected game state images and natural language rationales
                                                                  experimental groups. Our results support our hypothesis
was used to train the encoder-decoder network. Each RNN
                                                                  that rationales generated with the focused-view generator
in the encoder and the decoder was parameterized with
                                                                  and the complete-view generator were judged significantly
GRU cells with a hidden vector size of 256. The entire
                                                                  better across all dimensions than the random baseline
encoder-decoder network was trained for 100 epochs.
                                                                  (b = 1.90, t (252) = 8.09, p < .001). Our results also show
   We recruited an additional 128 participants, split into two
                                                                  that rationales generated by the candidate techniques were
experimental groups of our study through TurkPrime [10];
                                                                  judged significantly lower than the exemplary rationale.
Group1 (age range = 23 - 68, M = 37.4, SD = 9.92), Group
                                                                     The difference between the focused-view candidate
2 (age range = 24 - 59, M = 35.8, 7.67). Forty six percent
                                                                  rationales and exemplary rationales were significantly
of our participants were women and only two countries,
                                                                  greater than the difference between complete-view
United States and India, were reported when participants
                                                                  candidate rationales and exemplary rationales (p = .005).
were asked which country they were from. 93% percent of
                                                                  Surprisingly, this was because the exemplary rationales
all 128 participants reported that they resided in the United
                                                                  were rated lower in the presence of complete-view
States.
                                                                  candidate rationales (t (1530) = −32.12, p < .001).
Procedure                                                         Since three rationales were presented simultaneously in
                                                                  each video, it is likely that participants were rating the
Participants watched a series of five videos, each containing     rationales relative to each other. We also observe that
an action taken by an agent playing Frogger. In each video,       the complete-view candidate rationales received overall
the action was accompanied by three rationales generated by       higher ratings than the focused-view candidate rationales
three different techniques (see Figure 5):                        (t (1530) = 8.33, p < .001).
• The exemplary rationale is the rationale from our corpus           In summary, we established that both the focused-view
   that 3 researchers unanimously agreed on as the best           and complete-view configurations produce believable
   one for a particular action. Researchers independently         rationales that perform significantly better than the random
   selected rationales they deemed best and iterated till         baseline along four human factors dimensions. While the
   consensus was reached.                                         complete-view candidate rationales were judged to be
• The candidate rationale is the rationale produced by            preferable overall to focused-view candidate rationales, we
   our network, either the focused-view or complete-view          did not compare them to directly to each other because
   configuration. This is provided as an upper-bound for          stylistically one technique may be better suited based on the
   contrast with the next two techniques.                         task and/or game. Our between-subjects study methodology
                                                                  are suggestive but cannot be used to prove any claims
• The random rationale is a randomly chosen rationale             between the two experimental conditions.
   from our corpus.
For each rationale, participants used a 5-point Likert scale to   Qualitative Analysis
rate their endorsement of each of following four statements,      In this section, we look at the open-ended responses
which correspond to four dimensions of interest.                  provided by our participants to better understand the
                                                               Table 1: Descriptions for the emergent components
                                                               underlying the human-factor dimensions of the generated
                                                               rationales.

                                                                   Component            Description
                                                                Contextual Accuracy     Accurately describes pertinent events
                                                                                        in the context of the environment.
                                                                    Intelligibility     Typically error-free and is coherent in
                                                                                        terms of both grammar and sentence
                                                                                        structure.
                                                                     Awareness          Depicts and adequate understanding of
                                                                                        the rules of the environment.
                                                                     Relatability       Expresses the justification of the
                                                                                        action in a relatable manner and style.
                  (a) Focus-View condition.                        Strategic Detail     Exhibits strategic thinking, foresight,
                                                                                        and planning.



                                                               Confidence (D1) This dimension gauges the participant’s
                                                               faith in the agent’s ability to successfully complete it’s
                                                               task and has contextual accuracy, awareness, strategic
                                                               detail, and intelligibility as relevant components. With
                                                               respect to contextual accuracy, rationales that displayed
                                                               “. . . recognition of the environmental conditions and
                                                               [adaptation] to the conditions” (P22) were a positive
                                                               influence on confidence ratings, while redundant
                                                               information such as “just stating the obvious” (P42)
                                                               hindered confidence ratings.
                (b) Complete-View condition.                       Rationales that showed awareness “. . . of upcoming
                                                               dangers and what the best moves to make . . . [and] a
            Figure 5: Human judgment results.                  good way to plan” (P17) inspired confidence from the
                                                               participants. In terms of strategic detail, rationales that
                                                               showed ”. . . long-term planning and ability to analyze
                                                               information” (P28) yielded higher confidence ratings
criteria that participants used when making judgments about    compared to those that were ”. . . short-sighted and unable to
the confidence, human-likeness, adequate justification, and    think ahead” (P14) led to lower perceptions of confidence.
understandability of generated rationales. These situated
                                                                   Intelligibility alone, without awareness or strategic detail,
insights augment our understanding of rationale generating
                                                               was not enough to yield high confidence in rationales.
systems, enabling us to design better ones in the future.
                                                               However, rationales that were not intelligible (unintelligible)
   We analyzed the open-ended justifications participants      or coherent had a negative impact on participants’
provided using a combination of thematic analysis [4] and      confidence:
grounded theory [14]. We developed codes that addressed
different types of reasonings behind the ratings of the four     The [random and focused-view rationales] include
dimensions under investigation. Next, the research team          major mischaracterizations of the environment by
clustered the codes under emergent themes, which form            referring to an object not present or wrong time
the underlying components of the dimensions. Iterating           sequence, so I had very low confidence. (P66)
until consensus was reached, researchers settled on the        Human-likeness (D2) Intelligibility, relatability, and
most relevant five components: (1) Contextual Accuracy,        strategic detail are components that influenced participants’
(2) Intelligibility, (3) Awareness, (4) Relatability, and      perception of the extent to which the rationales were made
(5) Strategic Detail (see Table 1). At varying degrees,        by a human. Notably, intelligibility had mixed influences
multiple components influence more than one dimension;         on the human-likeness of the rationales depending on
that is, there isn’t a mutually exclusive one-to-one           what participants thought “being human” entailed. Some
relationship between components and dimensions.                perceived humans to be fallible and rated rationales with
   The remainder of this section will share our conclusions    errors more humanlike because rationales “. . . with typos
about how these components influence the dimensions of         or spelling errors . . . seem even more likely to have been
the human factors under investigation. When providing          generated by a human” (P19). Conversely, some thought
examples of our participants’ responses, we will refer         error-free rationales must come from a human, citing that a
to them using the following notation; P1 corresponds to        “computer just does not have the knowledge to understand
participant 1, P2 corresponds to participant 2, etc.           what is going on” (P24).
   With respect to relatability, rationales were often              Design Implications
perceived as more human-like when participants felt that “it        The understanding of the components and dimensions can
mirrored [their] thoughts” (P49), and “. . . [layed] things out     help us design better autonomous agents from a human
in a way that [they] would have” (P58). Affective rationales        factors perspective. These insights can also enable tweaking
had high relatability because they “express human emotions          of the network configuration and reverse-engineering it to
including hope and doubt” (P11).                                    maximize the likelihood of producing rationale sytles that
   Strategic planning had a mixed impact on human-likeness          meet the needs of the task, game, or agent persona.
just like intelligibility as it also depended on participants’         For instance, given the nature of the inputs, choosing a
perception of critical thinking and logical planning. Some          network configuration similar to the focused-view can afford
participants associated “. . . critical thinking [and ability to]   the generation of contextually accurate rationales. On the
predict future situations” (P6) with human-likeness whereas         other hand, the complete-view network configuration can
others associated logical planning with non-human-like, but         produce rationales with a higher degree of strategic detail
computer-like rigid and algorithmic thinking process flow.          that can be beneficial in contexts where detail is important,
                                                                    such an explainable oracle. Moreover, an in-game tutorial
Adequate Justification (D3) This dimension unpacks                  or a companion agent can be designed using a network
the extent to which participants think the rationale                configuration that generates relatabile outputs to keep the
adequately justifies the action taken and is influenced             player entertained and engaged.
by contextual accuracy, and awareness. Participants
downgraded rationales containing low levels of contextual                                Future Work
accuracy such as irrelevant details. As P11 puts it:                We can extend our current work in other domains of
  The [random and exemplary rationales] don’t pertain to            Explainable AI, exploring applications for other sequential
  this situation. [The Complete View] does, and is clearly          decision making tasks. We also plan to deploy our rationale
  the best justification for the action that Frogger took           generator with an collaborative NPC in an interactive
  because it moves him towards his end goal.                        game to investigate how the perception of a collaborative
                                                                    agent changes when players interact longitudinally (over an
   Beyond contextual accuracy, rationales that showcase             extended period of time). This longitudinal approach can
awareness of surroundings rate high on the adequate                 help us understand novelty effects of rationale generating
justification dimension. For instance, P11 rated the random         agents. Besides NPCs, our techniaues can improve teaching
rationale low because it showed “no awareness of the                and collaboration in games, especially around improvisation
surroundings”. For the same action, P11 gave high ratings           and co-creative collaboration in game-level designs
for the exemplary and focused-view rationales because each             Our data collection pipeline is currently designed to work
made the participant “. . . believe in the character’s ability to   with discrete-action games that have natural break points
judge their surroundings.”                                          where the player can be asked for explanations, making it
                                                                    less disruptive than continuous-time and -action games. The
Understandability (D4) For this dimension, components               next challenge is to extend and test our approach with more
such as contextual accuracy and relatability influence              continuous spaces where states aren’t as well defined and
participants’ perceptions of how much the rationales                rationales are harder to capture from moment-to-moment.
helped them understand the motivation behind the agent’s
actions. Contextually accurate rationales were found to                                   Conclusions
have a high influence with understandability. In fact, many         In this paper, we explore how human justifications for
expressed how the contextual accuracy, not the length of            their actions in a video game can be used to train
the rationale, mattered when it came to understandability.          a system to generate explanations for the actions of
While comparing the exemplary and focused-view rationales           autonomous game-playing agents. We introduce a pipeline
for understandability, P41 made a notable observation:              for automatically gathering a parallel corpus of game states
                                                                    annotated with human explanations and show how this
  The [exemplary and focused-view rationale] both
                                                                    corpus can be used to train encoder-decoder networks. The
  described the activities/objects in the immediate
                                                                    resultant model thus translates the state of the game and
  vicinity of the frog. However, [exemplary] was not
                                                                    the action performed by the agent into natural language,
  as strong as [focused-view] given the frog did not
                                                                    which we call a rationale. The rationales generated by our
  have to move just because of the car in front of
                                                                    technique are judged better than those of a random baseline
  him. [Focused-view] does a better job of providing
                                                                    and close to matching the upper bound of human rationales.
  understanding of the action
                                                                    By enabling autonomous agents to communicate about the
   Participants put themselves in the agent’s shoes and             motivations for their actions, we hope to provide users with
evaluated the understandability of the rationales based on          greater confidence in the agents while increasing perceptions
how relatable they were. In essence, some asked “Are these          of understanding and relatability.
the same reasons I would [give] for this action?” (P43). The
more relatable the rationale was, the higher it scored for
understandability.
                       References                              [13]   Marco Tulio Ribeiro, Sameer Singh, and Carlos
 [1]   Ashraf Abdul et al. “Trends and trajectories for               Guestrin. “Why should i trust you?: Explaining the
       explainable, accountable and intelligible systems: An          predictions of any classifier”. In: Proceedings of
       hci research agenda”. In: Proceedings of the 2018              the 22nd ACM SIGKDD international conference on
       CHI Conference on Human Factors in Computing                   knowledge discovery and data mining. ACM. 2016,
       Systems. ACM. 2018, p. 582.                                    pp. 1135–1144.
 [2]   Aswin Thomas Abraham and Kevin McGee. “AI               [14]   Anselm Strauss and Juliet Corbin. “Grounded theory
       for dynamic team-mate adaptation in games”. In:                methodology”. In: Handbook of qualitative research
       Computational Intelligence and Games (CIG), 2010               17 (1994), pp. 273–85.
       IEEE Symposium on. IEEE. 2010, pp. 419–426.             [15]   Chek Tien Tan and Ho-lun Cheng. “Personality-based
 [3]   Maria-Virginia Aponte, Guillaume Levieux, and                  Adaptation for Teamwork in Game Agents.” In:
       Stéphane Natkin. “Scaling the level of difficulty             AIIDE. 2007, pp. 37–42.
       in single player video games”. In: International        [16]   Sang-Won Um, Tae-Yong Kim, and Jong-Soo Choi.
       Conference on Entertainment Computing. Springer.               “Dynamic difficulty controlling game system”. In:
       2009, pp. 24–35.                                               IEEE Transactions on Consumer Electronics 53.2
 [4]   J Aronson. A pragmatic view of thematic analysis: the          (2007).
       qualitative report, 2,(1) Spring. 1994.                 [17]   Jason Yosinski et al. “Understanding neural networks
 [5]   Reuben Binns et al. “’It’s Reducing a Human                    through deep visualization”. In: arXiv preprint
       Being to a Percentage’: Perceptions of Justice in              arXiv:1506.06579 (2015).
       Algorithmic Decisions”. In: Proceedings of the 2018
       CHI Conference on Human Factors in Computing
       Systems. ACM. 2018, p. 377.
 [6]   Dennis M. Buede, Paul J. Sticha, and Elise T.
       Axelrad. “Conversational Non-Player Characters
       for Virtual Training”. In: Social, Cultural, and
       Behavioral Modeling. Ed. by Kevin S. Xu et
       al. Cham: Springer International Publishing, 2016,
       pp. 389–399. ISBN: 978-3-319-39931-7.
 [7]   Silvia Coradeschi and Lars Karlsson. “A role-based
       decision-mechanism for teams of reactive and
       coordinating agents”. In: Robot Soccer World Cup.
       Springer. 1997, pp. 112–122.
 [8]   Upol Ehsan et al. “Rationalization: A Neural
       Machine Translation Approach to Generating Natural
       Language Explanations”. In: Proceedings of the AAAI
       Conference on Artificial Intelligence, Ethics, and
       Society. Feb. 2018.
 [9]   Josua Krause, Adam Perer, and Kenney Ng.
       “Interacting with predictions: Visual inspection of
       black-box machine learning models”. In: Proceedings
       of the 2016 CHI Conference on Human Factors in
       Computing Systems. ACM. 2016, pp. 5686–5697.
[10]   Leib Litman, Jonathan Robinson, and Tzvi
       Abberbock. “TurkPrime. com: A versatile
       crowdsourcing data acquisition platform for the
       behavioral sciences”. In: Behavior research methods
       49.2 (2017), pp. 433–442.
[11]   Minh-Thang Luong, Hieu Pham, and Christopher D
       Manning. “Effective approaches to attention-based
       neural machine translation”. In: arXiv preprint
       arXiv:1508.04025 (2015).
[12]   Grant Pickett, Foaad Khosmood, and Allan Fowler.
       “Automated generation of conversational non player
       characters”. In: Eleventh Artificial Intelligence
       and Interactive Digital Entertainment Conference.
       Vol. 362. 2015.