<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning to Generate Natural Language Rationales for Game Playing Agents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Upol Ehsan zz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pradyumna Tambwekar y</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Larry Chany</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brent Harrisonz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark O. Riedly</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Many computer games feature non-player character (NPC) teammates and companions; however, playing with or against NPCs can be frustrating when they perform unexpectedly. These frustrations can be avoided if the NPC has the ability to explain its actions and motivations. When NPC behavior is controlled by a black box AI system it can be hard to generate the necessary explanations. In this paper, we present a system that generates human-like, natural language explanations-called rationales-of an agent's actions in a game environment regardless of how the decisions are made by a black box AI. We outline a robust data collection and neural network training pipeline that can be used to gather think-aloud data and train a rationale generation model for any similar sequential turn based decision making task. A human-subject study shows that our technique produces believable rationales for an agent playing the game, Frogger. We conclude with insights about how people perceive automatically generated rationales.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Non-player characters (NPCs) are interactive, autonomous
agents that play critical roles in most modern video games,
and are often seen as one crucial component of an engaging
player experience. As NPCs are given more autonomy
to make decisions, the likelihood that they perform in
an unexpected manner increases. These situations risk
interrupting a player’s engagement in the game world as they
attempt to justify the reasoning behind the unexpected NPC
behavior. One method to address this side-effect of increased
autonomy is to construct NPCs that have the ability to
explain their own actions and motivations for acting.</p>
      <p>
        The generation of natural language explanations for
autonomous agents is challenging when the agent is a
black-box AI, meaning that one doesn’t have access to
the agent’s decision-making process. Even if access were
possible, the mapping between inputs and decisions could
be difficult for people to interpret. Work by Ehsan et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
showed that machine learning models can be trained to
provide relevant and satisfactory rationales for their actions
using examples of human behavior and human-provided
Denotes equal contribution.
explanations. This is a potentially powerful tool that
could be used to create NPCs that can provide human
understandable explanations for their own actions, without
changing the underlying decision-making algorithms. This
in turn could give users more confidence in NPCs and
game playing agents and make NPCs and agents more
understandable and relatable.
      </p>
      <p>In the work by Ehsan et al., however, the rationale
generation model was trained using a semi-synthetic dataset
by developing a grammar that could generate variations
of actual human explanations to train their machine.
While their results were promising, creating the grammar
necessary to construct the requisite training examples is a
costly endeavor in terms of authorial effort. We build on this
work by developing a pipeline to automatically acquire a
corpus of human explanations that can be used to train a
rationale generation model to explain the actions of NPCs
and game playing agents. In this paper, we describe our
automated explanation corpus collection technique, neural
rationale generation model, and present the results of a
human-subjects study of human perceptions of generated
rationales in the game, Frogger.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Adaptive team-mate/adversary cooperation in games has
often been explored through the lens of decision making [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Researchers have looked to incorporate adaptive difficulty
in games (cf. [
        <xref ref-type="bibr" rid="ref16 ref3">3, 16</xref>
        ]) as well as build NPCs which evolve
by learning a player’s profile as ways to improve the players
experience [
        <xref ref-type="bibr" rid="ref15 ref7">7, 15</xref>
        ]. What is missing from this analysis is the
conversational engagement that comes with collaborating
with another human player.
      </p>
      <p>
        NPCs that can communicate in natural language have
previously been explored using classical machine learning
techniques. These methods often undertake a rule based
or probabilistic modeling approach. Buede et al. combine
natural language processing with dynamic probabilistic
models to maximize rapport between two conversing agents
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Prior work has also shown the capacity to use a
rule-based system to create a conversational character
generator [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Both of these methods, however, have
a high degree of hand-authoring involved in generating
these models. Our work can generate NPCs with similar
communicative capabilities with minimal hand-authoring.
      </p>
      <p>
        Explainable AI has attracted interest from researchers
across various domains. The authors of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] conduct a
comprehensive survey on burgeoning trends in explainable
and intelligible systems research. Certain intelligible
systems researchers look to use model-agnostic methods
to add transparency to the latent technology [
        <xref ref-type="bibr" rid="ref13 ref17">13, 17</xref>
        ].
Other researchers use visual representations to interpret the
decision-making process of a machine learning system [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
We situate our system as an agent that unpacks the thought
process of a human player, if they were to play the game.
      </p>
      <p>
        Evaluation of explainable AI systems can be difficult
because the appropriateness of an explanation is subjective.
One approach to evaluating such systems was proposed in
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. They presented participants with different fictionalized
explanations for the same decision and measured perceived
levels of justice among their participants. We adopt a similar
procedure to measure the quality of generated rationales
versus alternate baseline rationales.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Learning to Generate Rationales</title>
      <p>We define a rationale as an explanation that justifies an
action based on how a human would think. These rationales
do not reveal the true decision making process of an agent,
but still provide insights about why an agent made a decision
in a form that is easy for non-experts to understand.</p>
      <p>Rationale generation requires translating events in the
game environment into natural language outputs. Our
approach to rationale generation involves two steps:
(1) collect a corpus of think-aloud data from players who
explained their actions in a game environment; and (2) use
this corpus to train an encoder-decoder network to generate
plausible rationales for any action taken by an agent (see
Figure 1).</p>
      <sec id="sec-3-1">
        <title>Data Collection Interface</title>
        <p>There is no readily available dataset for the task of
learning to generate explanations. Thus, we developed a
methodology to collect live “think-aloud” data from players
as they played through a game. This section covers the two
objectives of our data collection endeavor:
1. Create a think-aloud protocol in which players provide
natural rationales for their actions.
2. Design an intuitive player experience that facilitates
accurate matching of the participants’ utterances to the
appropriate state in the environment.</p>
        <p>To train an agent to generate rationales we need data
linking game states and actions to their corresponding
natural language explanations. To achieve this goal, we
built a modified version of Frogger in which players
simultaneously play the game and also explain each of their
actions. The entire process is divided into three phases: (1) A
guided tutorial, (2) rationale collection, and (3) transcribed
explanation review.</p>
        <p>During the guided tutorial, our interface provides
instruction on how to play through the game, how to provide
natural language explanations, and how to review/modify
any explanations they have given. This helps ensure that
users are familiar with the interface and its use before they
begin providing explanations.</p>
        <p>During explanation collection, users play through the
game while explaining their actions out loud. Figure 2 shows
the game embedded into the explanation collection interface.
To help couple explanations with actions, the game pauses
for 10 seconds after an action is taken. During this time,
the player’s microphone automatically turns on and the
player is asked to explain their most recent action while a
speech-to-text library transcribes the explanation.</p>
        <p>Participants can view their transcribed text and edit it
if necessary. During preliminary testing, we observed that
players often repeat a move and the explanation is the
same. For ease, participants can indicate that the explanation
accompanying their most recent explanation is the same as
that of the last action performed.</p>
        <p>During transcribed explanation review, users are given
one final opportunity to review and edit the explanations
given during gameplay (see Figure 3). Players can step
through all of the actions they performed in the game and
see their accompanying transcribed explanations so they can
see the game context in which their explanations were given.</p>
        <p>The interface is designed so that no manual
hand-authoring/editing of our data was required before
pushing it into our machine learning model. Throughout
the game, players were given the opportunity to organically
edit their own data without impeding their work-flow. This
added layer of frictionless editing was crucial in ensuring
that we can directly input the collected data into the network
with zero manual cleaning.</p>
        <p>One core strength that facilitates transferability is that our
pipeline is environment and domain agnostic. While we use
Frogger as a test environment in our experiments, a similar
user experience can be designed using other turn-based
environments with minimal effort.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Neural Translation Model</title>
        <p>
          We use an encoder-decoder network to teach our network
to generate relevant natural language explanations for any
given action. These kinds of networks are commonly used
for machine translation tasks or dialogue generation, but
their ability to understand sequential dependencies between
the input and the output make it suitable for out task. Our
encoder decoder architecture is similar to that used in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
The network learns how to translate the input game state
representation X = x1; x2; :::; xn, comprised of the sprite
representation of the game combined with other influencing
factors, into an output explanation as a sequence of words
Y = y1; y2; :::; ym where yi is a word. The input X has
a fixed size of 261 tokens encompassing the game state
representation, lives left and the location of the frog. The
vocabulary sizes for the encoder and the decoder are 491 and
1104 respectively. Thus our network learns to translate game
state and action information into natural language rationales.
        </p>
        <p>
          The encoder and decoder are both recurrent neural
networks (RNN) comprised of GRU cells. The decoder
network uses an additional attention mechanism [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to
learn to weight the importance of different components of
the input with regard to their effect on the output.
        </p>
        <p>To simplify the learning process, the state of the game
environment is converted into a sequence of symbols where
each symbol represents a type of sprite. To this, we append
information concerning Frogger’s position, the most recent
action taken, and the number of lives the player has left to
create the input representation X. On top of this network
structure, we vary the input configurations with the intention
of producing varying styles of rationales. These two
configurations are titled the focused view configuration and
the complete-view configuration and are used throughout the
experiments presented in this paper.</p>
        <p>Focused-view Configuration In this configuration we
used a windowed representation of the grid, i.e. only a
7 7 window around the Frog was used in the input. Both
playing an optimal game of Frogger and generating relevant
explanations based on the current action taken typically only
requires this much local context. Therefore providing the
agent with only the window around Frogger helps the agent
produce explanations grounded in it’s neighborhood. In this
configuration we prioritized rationales focused on short term
awareness over long term planning.</p>
        <p>Complete-view Configuration The complete-view
configuration is an alternate setup that provides the entire
game board as context for the rationale generation. There
are two differences between this configuration and the
focused-view configuration. First, instead of showing the
network only a window of the game, we use the entire
game screen as a part of the input. The agent now has the
opportunity to learn which other long-term factors in the
game may influence it’s rationale. Second, we added noise
to each game state to force the network to generalize when
learning to generate rationales and give the model equal
opportunity to consider factors from all sectors of the game
screen. In this case noise was introduced by replacing input
grid values with dummy values. For each grid element,
there was a 20% chance that it would get replaced with a
dummy value.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Human Perception of Rationales Study</title>
      <p>In this section, we attempt to assess whether the
rationales generated by our system outperform baselines.
We further attempt to understand the underlying components
that influence the difference in the perceptions of the
generated rationales along four dimensions of human
factors: confidence, human-likeness, adequate justification,
and understandability. Frogger is a good candidate for our
experimental design of a rationale generation pipeline for
general sequential decision making tasks because it is a
simple Markovian environment; that is, the reasons for each
action can be easily separated, making it an ideal stepping
stone towards a real world environment.</p>
      <p>
        To gather the training set of game state annotations we
deployed our data collection pipeline on Amazon Turk
Prime [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. From 60 participants we collected over 2000
samples of human explanations corresponding to images of
the game when the explanations were made. This comprised
the corpus with which we trained the encoder-decoder
rationale generation network. The parallel corpus of the
collected game state images and natural language rationales
was used to train the encoder-decoder network. Each RNN
in the encoder and the decoder was parameterized with
GRU cells with a hidden vector size of 256. The entire
encoder-decoder network was trained for 100 epochs.
      </p>
      <p>
        We recruited an additional 128 participants, split into two
experimental groups of our study through TurkPrime [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ];
Group1 (age range = 23 - 68, M = 37.4, SD = 9.92), Group
2 (age range = 24 - 59, M = 35.8, 7.67). Forty six percent
of our participants were women and only two countries,
United States and India, were reported when participants
were asked which country they were from. 93% percent of
all 128 participants reported that they resided in the United
States.
      </p>
      <sec id="sec-4-1">
        <title>Procedure</title>
        <p>Participants watched a series of five videos, each containing
an action taken by an agent playing Frogger. In each video,
the action was accompanied by three rationales generated by
three different techniques (see Figure 5):</p>
        <p>The exemplary rationale is the rationale from our corpus
that 3 researchers unanimously agreed on as the best
one for a particular action. Researchers independently
selected rationales they deemed best and iterated till
consensus was reached.</p>
        <p>The candidate rationale is the rationale produced by
our network, either the focused-view or complete-view
configuration. This is provided as an upper-bound for
contrast with the next two techniques.</p>
        <p>The random rationale is a randomly chosen rationale
from our corpus.</p>
        <p>For each rationale, participants used a 5-point Likert scale to
rate their endorsement of each of following four statements,
which correspond to four dimensions of interest.
D1. Confidence: This rationale makes me confident in the
character’s ability to perform it’s task.</p>
        <p>D2. Human-likeness: This rationale looks like it was made
by a human.</p>
        <p>D3. Adequate justification: This rationale adequately
justifies the action taken.</p>
        <p>D4. Understandability: This rationale helped
understand why the agent behaved as it did.
me
Response options on the Likert scale ranged from ”strongly
disagree” to ”strongly agree.” In a free-text field, they
explained why the ratings they gave for a particular a set
of three rationales were similar or different. After answering
these questions, they provided demographic information.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Quantitative Results and Analysis</title>
        <p>We used a multi-level model to analyze both
between-subjects and within-subjects variables.
There were significant main effects of rationale
style ( 2 (2) = 594:80; p &lt; :001) and dimension
( 2 (2) = 66:86; p &lt; :001) on the ratings. The
main effect of experimental group was not significant
( 2 (1) = 0:070; p = 0:79). Figure 5 shows the
average responses to each question for the two different
experimental groups. Our results support our hypothesis
that rationales generated with the focused-view generator
and the complete-view generator were judged significantly
better across all dimensions than the random baseline
(b = 1:90; t (252) = 8:09; p &lt; :001). Our results also show
that rationales generated by the candidate techniques were
judged significantly lower than the exemplary rationale.</p>
        <p>The difference between the focused-view candidate
rationales and exemplary rationales were significantly
greater than the difference between complete-view
candidate rationales and exemplary rationales (p = :005).
Surprisingly, this was because the exemplary rationales
were rated lower in the presence of complete-view
candidate rationales (t (1530) = 32:12; p &lt; :001).
Since three rationales were presented simultaneously in
each video, it is likely that participants were rating the
rationales relative to each other. We also observe that
the complete-view candidate rationales received overall
higher ratings than the focused-view candidate rationales
(t (1530) = 8:33; p &lt; :001).</p>
        <p>In summary, we established that both the focused-view
and complete-view configurations produce believable
rationales that perform significantly better than the random
baseline along four human factors dimensions. While the
complete-view candidate rationales were judged to be
preferable overall to focused-view candidate rationales, we
did not compare them to directly to each other because
stylistically one technique may be better suited based on the
task and/or game. Our between-subjects study methodology
are suggestive but cannot be used to prove any claims
between the two experimental conditions.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Qualitative Analysis</title>
        <p>In this section, we look at the open-ended responses
provided by our participants to better understand the
(a) Focus-View condition.</p>
        <p>(b) Complete-View condition.
criteria that participants used when making judgments about
the confidence, human-likeness, adequate justification, and
understandability of generated rationales. These situated
insights augment our understanding of rationale generating
systems, enabling us to design better ones in the future.</p>
        <p>
          We analyzed the open-ended justifications participants
provided using a combination of thematic analysis [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and
grounded theory [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. We developed codes that addressed
different types of reasonings behind the ratings of the four
dimensions under investigation. Next, the research team
clustered the codes under emergent themes, which form
the underlying components of the dimensions. Iterating
until consensus was reached, researchers settled on the
most relevant five components: (1) Contextual Accuracy,
(2) Intelligibility, (3) Awareness, (4) Relatability, and
(5) Strategic Detail (see Table 1). At varying degrees,
multiple components influence more than one dimension;
that is, there isn’t a mutually exclusive one-to-one
relationship between components and dimensions.
        </p>
        <p>The remainder of this section will share our conclusions
about how these components influence the dimensions of
the human factors under investigation. When providing
examples of our participants’ responses, we will refer
to them using the following notation; P1 corresponds to
participant 1, P2 corresponds to participant 2, etc.</p>
        <p>Confidence (D1) This dimension gauges the participant’s
faith in the agent’s ability to successfully complete it’s
task and has contextual accuracy, awareness, strategic
detail, and intelligibility as relevant components. With
respect to contextual accuracy, rationales that displayed
“. . . recognition of the environmental conditions and
[adaptation] to the conditions” (P22) were a positive
influence on confidence ratings, while redundant
information such as “just stating the obvious” (P42)
hindered confidence ratings.</p>
        <p>Rationales that showed awareness “. . . of upcoming
dangers and what the best moves to make . . . [and] a
good way to plan” (P17) inspired confidence from the
participants. In terms of strategic detail, rationales that
showed ”. . . long-term planning and ability to analyze
information” (P28) yielded higher confidence ratings
compared to those that were ”. . . short-sighted and unable to
think ahead” (P14) led to lower perceptions of confidence.</p>
        <p>Intelligibility alone, without awareness or strategic detail,
was not enough to yield high confidence in rationales.
However, rationales that were not intelligible (unintelligible)
or coherent had a negative impact on participants’
confidence:</p>
        <p>The [random and focused-view rationales] include
major mischaracterizations of the environment by
referring to an object not present or wrong time
sequence, so I had very low confidence. (P66)
Human-likeness (D2) Intelligibility, relatability, and
strategic detail are components that influenced participants’
perception of the extent to which the rationales were made
by a human. Notably, intelligibility had mixed influences
on the human-likeness of the rationales depending on
what participants thought “being human” entailed. Some
perceived humans to be fallible and rated rationales with
errors more humanlike because rationales “. . . with typos
or spelling errors . . . seem even more likely to have been
generated by a human” (P19). Conversely, some thought
error-free rationales must come from a human, citing that a
“computer just does not have the knowledge to understand
what is going on” (P24).</p>
        <p>With respect to relatability, rationales were often
perceived as more human-like when participants felt that “it
mirrored [their] thoughts” (P49), and “. . . [layed] things out
in a way that [they] would have” (P58). Affective rationales
had high relatability because they “express human emotions
including hope and doubt” (P11).</p>
        <p>Strategic planning had a mixed impact on human-likeness
just like intelligibility as it also depended on participants’
perception of critical thinking and logical planning. Some
participants associated “. . . critical thinking [and ability to]
predict future situations” (P6) with human-likeness whereas
others associated logical planning with non-human-like, but
computer-like rigid and algorithmic thinking process flow.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Adequate Justification (D3) This dimension unpacks</title>
        <p>the extent to which participants think the rationale
adequately justifies the action taken and is influenced
by contextual accuracy, and awareness. Participants
downgraded rationales containing low levels of contextual
accuracy such as irrelevant details. As P11 puts it:
The [random and exemplary rationales] don’t pertain to
this situation. [The Complete View] does, and is clearly
the best justification for the action that Frogger took
because it moves him towards his end goal.</p>
        <p>Beyond contextual accuracy, rationales that showcase
awareness of surroundings rate high on the adequate
justification dimension. For instance, P11 rated the random
rationale low because it showed “no awareness of the
surroundings”. For the same action, P11 gave high ratings
for the exemplary and focused-view rationales because each
made the participant “. . . believe in the character’s ability to
judge their surroundings.”
Understandability (D4) For this dimension, components
such as contextual accuracy and relatability influence
participants’ perceptions of how much the rationales
helped them understand the motivation behind the agent’s
actions. Contextually accurate rationales were found to
have a high influence with understandability. In fact, many
expressed how the contextual accuracy, not the length of
the rationale, mattered when it came to understandability.
While comparing the exemplary and focused-view rationales
for understandability, P41 made a notable observation:
The [exemplary and focused-view rationale] both
described the activities/objects in the immediate
vicinity of the frog. However, [exemplary] was not
as strong as [focused-view] given the frog did not
have to move just because of the car in front of
him. [Focused-view] does a better job of providing
understanding of the action</p>
        <p>Participants put themselves in the agent’s shoes and
evaluated the understandability of the rationales based on
how relatable they were. In essence, some asked “Are these
the same reasons I would [give] for this action?” (P43). The
more relatable the rationale was, the higher it scored for
understandability.
The understanding of the components and dimensions can
help us design better autonomous agents from a human
factors perspective. These insights can also enable tweaking
of the network configuration and reverse-engineering it to
maximize the likelihood of producing rationale sytles that
meet the needs of the task, game, or agent persona.</p>
        <p>For instance, given the nature of the inputs, choosing a
network configuration similar to the focused-view can afford
the generation of contextually accurate rationales. On the
other hand, the complete-view network configuration can
produce rationales with a higher degree of strategic detail
that can be beneficial in contexts where detail is important,
such an explainable oracle. Moreover, an in-game tutorial
or a companion agent can be designed using a network
configuration that generates relatabile outputs to keep the
player entertained and engaged.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Future Work</title>
      <p>We can extend our current work in other domains of
Explainable AI, exploring applications for other sequential
decision making tasks. We also plan to deploy our rationale
generator with an collaborative NPC in an interactive
game to investigate how the perception of a collaborative
agent changes when players interact longitudinally (over an
extended period of time). This longitudinal approach can
help us understand novelty effects of rationale generating
agents. Besides NPCs, our techniaues can improve teaching
and collaboration in games, especially around improvisation
and co-creative collaboration in game-level designs</p>
      <p>Our data collection pipeline is currently designed to work
with discrete-action games that have natural break points
where the player can be asked for explanations, making it
less disruptive than continuous-time and -action games. The
next challenge is to extend and test our approach with more
continuous spaces where states aren’t as well defined and
rationales are harder to capture from moment-to-moment.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper, we explore how human justifications for
their actions in a video game can be used to train
a system to generate explanations for the actions of
autonomous game-playing agents. We introduce a pipeline
for automatically gathering a parallel corpus of game states
annotated with human explanations and show how this
corpus can be used to train encoder-decoder networks. The
resultant model thus translates the state of the game and
the action performed by the agent into natural language,
which we call a rationale. The rationales generated by our
technique are judged better than those of a random baseline
and close to matching the upper bound of human rationales.
By enabling autonomous agents to communicate about the
motivations for their actions, we hope to provide users with
greater confidence in the agents while increasing perceptions
of understanding and relatability.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ashraf</given-names>
            <surname>Abdul</surname>
          </string-name>
          et al. “
          <article-title>Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda”</article-title>
          .
          <source>In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM</source>
          .
          <year>2018</year>
          , p.
          <fpage>582</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Aswin</given-names>
            <surname>Thomas Abraham and Kevin McGee</surname>
          </string-name>
          . “
          <article-title>AI for dynamic team-mate adaptation in games”</article-title>
          .
          <source>In: Computational Intelligence and Games (CIG)</source>
          ,
          <source>2010 IEEE Symposium on. IEEE</source>
          .
          <year>2010</year>
          , pp.
          <fpage>419</fpage>
          -
          <lpage>426</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Maria-Virginia</surname>
            <given-names>Aponte</given-names>
          </string-name>
          , Guillaume Levieux, and Ste´phane Natkin.
          <article-title>“Scaling the level of difficulty in single player video games”</article-title>
          .
          <source>In: International Conference on Entertainment Computing</source>
          . Springer.
          <year>2009</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J</given-names>
            <surname>Aronson</surname>
          </string-name>
          .
          <article-title>A pragmatic view of thematic analysis: the qualitative report, 2,(1</article-title>
          )
          <string-name>
            <surname>Spring</surname>
          </string-name>
          .
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Reuben</given-names>
            <surname>Binns</surname>
          </string-name>
          et al. “
          <article-title>'It's Reducing a Human Being to a Percentage': Perceptions of Justice in Algorithmic Decisions”</article-title>
          .
          <source>In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM</source>
          .
          <year>2018</year>
          , p.
          <fpage>377</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Dennis</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Buede</surname>
            ,
            <given-names>Paul J.</given-names>
          </string-name>
          <string-name>
            <surname>Sticha</surname>
          </string-name>
          , and
          <string-name>
            <surname>Elise</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Axelrad</surname>
          </string-name>
          . “
          <article-title>Conversational Non-Player Characters for Virtual Training”</article-title>
          . In: Social, Cultural, and Behavioral Modeling. Ed. by Kevin S. Xu et al. Cham: Springer International Publishing,
          <year>2016</year>
          , pp.
          <fpage>389</fpage>
          -
          <lpage>399</lpage>
          . ISBN:
          <fpage>978</fpage>
          -3-
          <fpage>319</fpage>
          -39931-7.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Silvia</given-names>
            <surname>Coradeschi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lars</given-names>
            <surname>Karlsson</surname>
          </string-name>
          . “
          <article-title>A role-based decision-mechanism for teams of reactive and coordinating agents”</article-title>
          . In: Robot Soccer World Cup. Springer.
          <year>1997</year>
          , pp.
          <fpage>112</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Upol</given-names>
            <surname>Ehsan</surname>
          </string-name>
          et al. “
          <article-title>Rationalization: A Neural Machine Translation Approach to Generating Natural Language Explanations”</article-title>
          .
          <source>In: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , Ethics, and
          <string-name>
            <surname>Society</surname>
          </string-name>
          . Feb.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Josua</given-names>
            <surname>Krause</surname>
          </string-name>
          , Adam Perer, and Kenney Ng. “
          <article-title>Interacting with predictions: Visual inspection of black-box machine learning models”</article-title>
          .
          <source>In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM</source>
          .
          <year>2016</year>
          , pp.
          <fpage>5686</fpage>
          -
          <lpage>5697</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Leib</surname>
            <given-names>Litman</given-names>
          </string-name>
          , Jonathan Robinson, and Tzvi Abberbock.
          <article-title>“TurkPrime. com: A versatile crowdsourcing data acquisition platform for the behavioral sciences”</article-title>
          .
          <source>In: Behavior research methods 49.2</source>
          (
          <issue>2017</issue>
          ), pp.
          <fpage>433</fpage>
          -
          <lpage>442</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Minh-Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          , Hieu Pham, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          . “
          <article-title>Effective approaches to attention-based neural machine translation”</article-title>
          .
          <source>In: arXiv preprint arXiv:1508.04025</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Grant</surname>
            <given-names>Pickett</given-names>
          </string-name>
          , Foaad Khosmood, and Allan Fowler. “
          <article-title>Automated generation of conversational non player characters”</article-title>
          .
          <source>In: Eleventh Artificial Intelligence and Interactive Digital Entertainment Conference</source>
          . Vol.
          <volume>362</volume>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Marco</given-names>
            <surname>Tulio</surname>
          </string-name>
          <string-name>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          . “
          <article-title>Why should i trust you?: Explaining the predictions of any classifier”</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM</source>
          .
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Anselm</given-names>
            <surname>Strauss</surname>
          </string-name>
          and
          <string-name>
            <given-names>Juliet</given-names>
            <surname>Corbin</surname>
          </string-name>
          . “
          <article-title>Grounded theory methodology”</article-title>
          .
          <source>In: Handbook of qualitative research 17</source>
          (
          <year>1994</year>
          ), pp.
          <fpage>273</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Chek</given-names>
            <surname>Tien</surname>
          </string-name>
          Tan and Ho-lun Cheng. “
          <article-title>Personality-based Adaptation for Teamwork in Game Agents</article-title>
          .” In: AIIDE.
          <year>2007</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Sang-Won</surname>
            <given-names>Um</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tae-Yong Kim</surname>
          </string-name>
          , and Jong-Soo Choi. “
          <article-title>Dynamic difficulty controlling game system”</article-title>
          .
          <source>In: IEEE Transactions on Consumer Electronics 53.2</source>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Yosinski</surname>
          </string-name>
          et al. “
          <article-title>Understanding neural networks through deep visualization”</article-title>
          .
          <source>In: arXiv preprint arXiv:1506.06579</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>