=Paper= {{Paper |id=Vol-2282/EXAG_122 |storemode=property |title=Learning to Generate Natural Language Rationales for Game Playing Agents |pdfUrl=https://ceur-ws.org/Vol-2282/EXAG_122.pdf |volume=Vol-2282 |authors=Upol Ehsan,Pradyumna Tambwekar,Larry Chan,Brent Harrison,Mark O. Riedl |dblpUrl=https://dblp.org/rec/conf/aiide/EhsanTCHR18 }} ==Learning to Generate Natural Language Rationales for Game Playing Agents== https://ceur-ws.org/Vol-2282/EXAG_122.pdf

Learning to Generate Natural Language Rationales for Game Playing Agents

Upol Ehsan∗‡‡ , Pradyumna Tambwekar∗† , Larry Chan† ,
Brent Harrison‡ , and Mark O. Riedl†
‡‡
Department of Information Science, Cornell University
†
School of Interactive Computing, Georgia Institute of Technology
‡
Department of Computer Science, University of Kentucky

Abstract explanations. This is a potentially powerful tool that
could be used to create NPCs that can provide human
Many computer games feature non-player character (NPC) understandable explanations for their own actions, without
teammates and companions; however, playing with or against
NPCs can be frustrating when they perform unexpectedly.
changing the underlying decision-making algorithms. This
These frustrations can be avoided if the NPC has the ability in turn could give users more confidence in NPCs and
to explain its actions and motivations. When NPC behavior game playing agents and make NPCs and agents more
is controlled by a black box AI system it can be hard understandable and relatable.
to generate the necessary explanations. In this paper, we In the work by Ehsan et al., however, the rationale
present a system that generates human-like, natural language generation model was trained using a semi-synthetic dataset
explanations—called rationales—of an agent’s actions in a by developing a grammar that could generate variations
game environment regardless of how the decisions are made of actual human explanations to train their machine.
by a black box AI. We outline a robust data collection While their results were promising, creating the grammar
and neural network training pipeline that can be used to necessary to construct the requisite training examples is a
gather think-aloud data and train a rationale generation model
for any similar sequential turn based decision making task.
costly endeavor in terms of authorial effort. We build on this
A human-subject study shows that our technique produces work by developing a pipeline to automatically acquire a
believable rationales for an agent playing the game, Frogger. corpus of human explanations that can be used to train a
We conclude with insights about how people perceive rationale generation model to explain the actions of NPCs
automatically generated rationales. and game playing agents. In this paper, we describe our
automated explanation corpus collection technique, neural
rationale generation model, and present the results of a
Introduction human-subjects study of human perceptions of generated
Non-player characters (NPCs) are interactive, autonomous rationales in the game, Frogger.
agents that play critical roles in most modern video games,
and are often seen as one crucial component of an engaging Related Work
player experience. As NPCs are given more autonomy Adaptive team-mate/adversary cooperation in games has
to make decisions, the likelihood that they perform in often been explored through the lens of decision making [2].
an unexpected manner increases. These situations risk Researchers have looked to incorporate adaptive difficulty
interrupting a player’s engagement in the game world as they in games (cf. [3, 16]) as well as build NPCs which evolve
attempt to justify the reasoning behind the unexpected NPC by learning a player’s profile as ways to improve the players
behavior. One method to address this side-effect of increased experience [7, 15]. What is missing from this analysis is the
autonomy is to construct NPCs that have the ability to conversational engagement that comes with collaborating
explain their own actions and motivations for acting. with another human player.
The generation of natural language explanations for NPCs that can communicate in natural language have
autonomous agents is challenging when the agent is a previously been explored using classical machine learning
black-box AI, meaning that one doesn’t have access to techniques. These methods often undertake a rule based
the agent’s decision-making process. Even if access were or probabilistic modeling approach. Buede et al. combine
possible, the mapping between inputs and decisions could natural language processing with dynamic probabilistic
be difficult for people to interpret. Work by Ehsan et al. [8] models to maximize rapport between two conversing agents
showed that machine learning models can be trained to [6]. Prior work has also shown the capacity to use a
provide relevant and satisfactory rationales for their actions rule-based system to create a conversational character
using examples of human behavior and human-provided generator [12]. Both of these methods, however, have
∗
Denotes equal contribution. a high degree of hand-authoring involved in generating
these models. Our work can generate NPCs with similar
communicative capabilities with minimal hand-authoring.
Figure 1: End to End Pipeline for training a system that can
generate explanations.

Explainable AI has attracted interest from researchers Figure 2: Players take an action and verbalize their rationale
across various domains. The authors of [1] conduct a for that action. (1) After taking each action, the game
comprehensive survey on burgeoning trends in explainable pauses for 10 seconds. (2) Speech-to-text transcribes the
and intelligible systems research. Certain intelligible participant’s rationale for the action. (3) Participants can
systems researchers look to use model-agnostic methods view their transcribed rationales near-real time and edit
to add transparency to the latent technology [13, 17]. them, if needed.
Other researchers use visual representations to interpret the
decision-making process of a machine learning system [9].
We situate our system as an agent that unpacks the thought 1. Create a think-aloud protocol in which players provide
process of a human player, if they were to play the game. natural rationales for their actions.
Evaluation of explainable AI systems can be difficult
because the appropriateness of an explanation is subjective. 2. Design an intuitive player experience that facilitates
One approach to evaluating such systems was proposed in accurate matching of the participants’ utterances to the
[5]. They presented participants with different fictionalized appropriate state in the environment.
explanations for the same decision and measured perceived To train an agent to generate rationales we need data
levels of justice among their participants. We adopt a similar linking game states and actions to their corresponding
procedure to measure the quality of generated rationales natural language explanations. To achieve this goal, we
versus alternate baseline rationales. built a modified version of Frogger in which players
simultaneously play the game and also explain each of their
Learning to Generate Rationales actions. The entire process is divided into three phases: (1) A
We define a rationale as an explanation that justifies an guided tutorial, (2) rationale collection, and (3) transcribed
action based on how a human would think. These rationales explanation review.
do not reveal the true decision making process of an agent, During the guided tutorial, our interface provides
but still provide insights about why an agent made a decision instruction on how to play through the game, how to provide
in a form that is easy for non-experts to understand. natural language explanations, and how to review/modify
Rationale generation requires translating events in the any explanations they have given. This helps ensure that
game environment into natural language outputs. Our users are familiar with the interface and its use before they
approach to rationale generation involves two steps: begin providing explanations.
(1) collect a corpus of think-aloud data from players who During explanation collection, users play through the
explained their actions in a game environment; and (2) use game while explaining their actions out loud. Figure 2 shows
this corpus to train an encoder-decoder network to generate the game embedded into the explanation collection interface.
plausible rationales for any action taken by an agent (see To help couple explanations with actions, the game pauses
Figure 1). for 10 seconds after an action is taken. During this time,
the player’s microphone automatically turns on and the
Data Collection Interface player is asked to explain their most recent action while a
There is no readily available dataset for the task of speech-to-text library transcribes the explanation.
learning to generate explanations. Thus, we developed a Participants can view their transcribed text and edit it
methodology to collect live “think-aloud” data from players if necessary. During preliminary testing, we observed that
as they played through a game. This section covers the two players often repeat a move and the explanation is the
objectives of our data collection endeavor: same. For ease, participants can indicate that the explanation
state and action information into natural language rationales.
The encoder and decoder are both recurrent neural
networks (RNN) comprised of GRU cells. The decoder
network uses an additional attention mechanism [11] to
learn to weight the importance of different components of
the input with regard to their effect on the output.
To simplify the learning process, the state of the game
environment is converted into a sequence of symbols where
each symbol represents a type of sprite. To this, we append
information concerning Frogger’s position, the most recent
action taken, and the number of lives the player has left to
Figure 3: Players can step-through each of their create the input representation X. On top of this network
action-rationale pairs and edit if necessary. (1) Players structure, we vary the input configurations with the intention
can watch a replay of their actions while editing their of producing varying styles of rationales. These two
rationales. (2) Players use these buttons to control the flow configurations are titled the focused view configuration and
of their step-through. (3) The rationale for the current action the complete-view configuration and are used throughout the
gets highlighted for review. experiments presented in this paper.
Focused-view Configuration In this configuration we
used a windowed representation of the grid, i.e. only a
accompanying their most recent explanation is the same as 7 × 7 window around the Frog was used in the input. Both
that of the last action performed. playing an optimal game of Frogger and generating relevant
During transcribed explanation review, users are given explanations based on the current action taken typically only
one final opportunity to review and edit the explanations requires this much local context. Therefore providing the
given during gameplay (see Figure 3). Players can step agent with only the window around Frogger helps the agent
through all of the actions they performed in the game and produce explanations grounded in it’s neighborhood. In this
see their accompanying transcribed explanations so they can configuration we prioritized rationales focused on short term
see the game context in which their explanations were given. awareness over long term planning.
The interface is designed so that no manual
hand-authoring/editing of our data was required before Complete-view Configuration The complete-view
pushing it into our machine learning model. Throughout configuration is an alternate setup that provides the entire
the game, players were given the opportunity to organically game board as context for the rationale generation. There
edit their own data without impeding their work-flow. This are two differences between this configuration and the
added layer of frictionless editing was crucial in ensuring focused-view configuration. First, instead of showing the
that we can directly input the collected data into the network network only a window of the game, we use the entire
with zero manual cleaning. game screen as a part of the input. The agent now has the
One core strength that facilitates transferability is that our opportunity to learn which other long-term factors in the
pipeline is environment and domain agnostic. While we use game may influence it’s rationale. Second, we added noise
Frogger as a test environment in our experiments, a similar to each game state to force the network to generalize when
user experience can be designed using other turn-based learning to generate rationales and give the model equal
environments with minimal effort. opportunity to consider factors from all sectors of the game
screen. In this case noise was introduced by replacing input
Neural Translation Model grid values with dummy values. For each grid element,
there was a 20% chance that it would get replaced with a
We use an encoder-decoder network to teach our network dummy value.
to generate relevant natural language explanations for any
given action. These kinds of networks are commonly used
for machine translation tasks or dialogue generation, but Human Perception of Rationales Study
their ability to understand sequential dependencies between In this section, we attempt to assess whether the
the input and the output make it suitable for out task. Our rationales generated by our system outperform baselines.
encoder decoder architecture is similar to that used in [8]. We further attempt to understand the underlying components
The network learns how to translate the input game state that influence the difference in the perceptions of the
representation X = x1 , x2 , ..., xn , comprised of the sprite generated rationales along four dimensions of human
representation of the game combined with other influencing factors: confidence, human-likeness, adequate justification,
factors, into an output explanation as a sequence of words and understandability. Frogger is a good candidate for our
Y = y1 , y2 , ..., ym where yi is a word. The input X has experimental design of a rationale generation pipeline for
a fixed size of 261 tokens encompassing the game state general sequential decision making tasks because it is a
representation, lives left and the location of the frog. The simple Markovian environment; that is, the reasons for each
vocabulary sizes for the encoder and the decoder are 491 and action can be easily separated, making it an ideal stepping
1104 respectively. Thus our network learns to translate game stone towards a real world environment.
D1. Confidence: This rationale makes me confident in the
character’s ability to perform it’s task.
D2. Human-likeness: This rationale looks like it was made
by a human.
D3. Adequate justification: This rationale adequately
justifies the action taken.
D4. Understandability: This rationale helped me
understand why the agent behaved as it did.
Response options on the Likert scale ranged from ”strongly
disagree” to ”strongly agree.” In a free-text field, they
explained why the ratings they gave for a particular a set
Figure 4: Screenshot from user study (setup 2) depicting of three rationales were similar or different. After answering
the action taken and the rationales: P = Random, Q = these questions, they provided demographic information.
Exemplary, R = Candidate
Quantitative Results and Analysis
We used a multi-level model to analyze both
To gather the training set of game state annotations we
between-subjects and within-subjects variables.
deployed our data collection pipeline on Amazon Turk
There were significant main effects of rationale
Prime [10]. From 60 participants we collected over 2000
style (χ2 (2) = 594.80, p < .001) and dimension
samples of human explanations corresponding to images of
(χ2 (2) = 66.86, p < .001) on the ratings. The
the game when the explanations were made. This comprised
main effect of experimental group was not significant
the corpus with which we trained the encoder-decoder
(χ2 (1) = 0.070, p = 0.79). Figure 5 shows the
rationale generation network. The parallel corpus of the
average responses to each question for the two different
collected game state images and natural language rationales
experimental groups. Our results support our hypothesis
was used to train the encoder-decoder network. Each RNN
that rationales generated with the focused-view generator
in the encoder and the decoder was parameterized with
and the complete-view generator were judged significantly
GRU cells with a hidden vector size of 256. The entire
better across all dimensions than the random baseline
encoder-decoder network was trained for 100 epochs.
(b = 1.90, t (252) = 8.09, p < .001). Our results also show
We recruited an additional 128 participants, split into two
that rationales generated by the candidate techniques were
experimental groups of our study through TurkPrime [10];
judged significantly lower than the exemplary rationale.
Group1 (age range = 23 - 68, M = 37.4, SD = 9.92), Group
The difference between the focused-view candidate
2 (age range = 24 - 59, M = 35.8, 7.67). Forty six percent
rationales and exemplary rationales were significantly
of our participants were women and only two countries,
greater than the difference between complete-view
United States and India, were reported when participants
candidate rationales and exemplary rationales (p = .005).
were asked which country they were from. 93% percent of
Surprisingly, this was because the exemplary rationales
all 128 participants reported that they resided in the United
were rated lower in the presence of complete-view
States.
candidate rationales (t (1530) = −32.12, p < .001).
Procedure Since three rationales were presented simultaneously in
each video, it is likely that participants were rating the
Participants watched a series of five videos, each containing rationales relative to each other. We also observe that
an action taken by an agent playing Frogger. In each video, the complete-view candidate rationales received overall
the action was accompanied by three rationales generated by higher ratings than the focused-view candidate rationales
three different techniques (see Figure 5): (t (1530) = 8.33, p < .001).
• The exemplary rationale is the rationale from our corpus In summary, we established that both the focused-view
that 3 researchers unanimously agreed on as the best and complete-view configurations produce believable
one for a particular action. Researchers independently rationales that perform significantly better than the random
selected rationales they deemed best and iterated till baseline along four human factors dimensions. While the
consensus was reached. complete-view candidate rationales were judged to be
• The candidate rationale is the rationale produced by preferable overall to focused-view candidate rationales, we
our network, either the focused-view or complete-view did not compare them to directly to each other because
configuration. This is provided as an upper-bound for stylistically one technique may be better suited based on the
contrast with the next two techniques. task and/or game. Our between-subjects study methodology
are suggestive but cannot be used to prove any claims
• The random rationale is a randomly chosen rationale between the two experimental conditions.
from our corpus.
For each rationale, participants used a 5-point Likert scale to Qualitative Analysis
rate their endorsement of each of following four statements, In this section, we look at the open-ended responses
which correspond to four dimensions of interest. provided by our participants to better understand the
Table 1: Descriptions for the emergent components
underlying the human-factor dimensions of the generated
rationales.

Component Description
Contextual Accuracy Accurately describes pertinent events
in the context of the environment.
Intelligibility Typically error-free and is coherent in
terms of both grammar and sentence
structure.
Awareness Depicts and adequate understanding of
the rules of the environment.
Relatability Expresses the justification of the
action in a relatable manner and style.
(a) Focus-View condition. Strategic Detail Exhibits strategic thinking, foresight,
and planning.

Confidence (D1) This dimension gauges the participant’s
faith in the agent’s ability to successfully complete it’s
task and has contextual accuracy, awareness, strategic
detail, and intelligibility as relevant components. With
respect to contextual accuracy, rationales that displayed
“. . . recognition of the environmental conditions and
[adaptation] to the conditions” (P22) were a positive
influence on confidence ratings, while redundant
information such as “just stating the obvious” (P42)
hindered confidence ratings.
(b) Complete-View condition. Rationales that showed awareness “. . . of upcoming
dangers and what the best moves to make . . . [and] a
Figure 5: Human judgment results. good way to plan” (P17) inspired confidence from the
participants. In terms of strategic detail, rationales that
showed ”. . . long-term planning and ability to analyze
information” (P28) yielded higher confidence ratings
criteria that participants used when making judgments about compared to those that were ”. . . short-sighted and unable to
the confidence, human-likeness, adequate justification, and think ahead” (P14) led to lower perceptions of confidence.
understandability of generated rationales. These situated
Intelligibility alone, without awareness or strategic detail,
insights augment our understanding of rationale generating
was not enough to yield high confidence in rationales.
systems, enabling us to design better ones in the future.
However, rationales that were not intelligible (unintelligible)
We analyzed the open-ended justifications participants or coherent had a negative impact on participants’
provided using a combination of thematic analysis [4] and confidence:
grounded theory [14]. We developed codes that addressed
different types of reasonings behind the ratings of the four The [random and focused-view rationales] include
dimensions under investigation. Next, the research team major mischaracterizations of the environment by
clustered the codes under emergent themes, which form referring to an object not present or wrong time
the underlying components of the dimensions. Iterating sequence, so I had very low confidence. (P66)
until consensus was reached, researchers settled on the Human-likeness (D2) Intelligibility, relatability, and
most relevant five components: (1) Contextual Accuracy, strategic detail are components that influenced participants’
(2) Intelligibility, (3) Awareness, (4) Relatability, and perception of the extent to which the rationales were made
(5) Strategic Detail (see Table 1). At varying degrees, by a human. Notably, intelligibility had mixed influences
multiple components influence more than one dimension; on the human-likeness of the rationales depending on
that is, there isn’t a mutually exclusive one-to-one what participants thought “being human” entailed. Some
relationship between components and dimensions. perceived humans to be fallible and rated rationales with
The remainder of this section will share our conclusions errors more humanlike because rationales “. . . with typos
about how these components influence the dimensions of or spelling errors . . . seem even more likely to have been
the human factors under investigation. When providing generated by a human” (P19). Conversely, some thought
examples of our participants’ responses, we will refer error-free rationales must come from a human, citing that a
to them using the following notation; P1 corresponds to “computer just does not have the knowledge to understand
participant 1, P2 corresponds to participant 2, etc. what is going on” (P24).
With respect to relatability, rationales were often Design Implications
perceived as more human-like when participants felt that “it The understanding of the components and dimensions can
mirrored [their] thoughts” (P49), and “. . . [layed] things out help us design better autonomous agents from a human
in a way that [they] would have” (P58). Affective rationales factors perspective. These insights can also enable tweaking
had high relatability because they “express human emotions of the network configuration and reverse-engineering it to
including hope and doubt” (P11). maximize the likelihood of producing rationale sytles that
Strategic planning had a mixed impact on human-likeness meet the needs of the task, game, or agent persona.
just like intelligibility as it also depended on participants’ For instance, given the nature of the inputs, choosing a
perception of critical thinking and logical planning. Some network configuration similar to the focused-view can afford
participants associated “. . . critical thinking [and ability to] the generation of contextually accurate rationales. On the
predict future situations” (P6) with human-likeness whereas other hand, the complete-view network configuration can
others associated logical planning with non-human-like, but produce rationales with a higher degree of strategic detail
computer-like rigid and algorithmic thinking process flow. that can be beneficial in contexts where detail is important,
such an explainable oracle. Moreover, an in-game tutorial
Adequate Justification (D3) This dimension unpacks or a companion agent can be designed using a network
the extent to which participants think the rationale configuration that generates relatabile outputs to keep the
adequately justifies the action taken and is influenced player entertained and engaged.
by contextual accuracy, and awareness. Participants
downgraded rationales containing low levels of contextual Future Work
accuracy such as irrelevant details. As P11 puts it: We can extend our current work in other domains of
The [random and exemplary rationales] don’t pertain to Explainable AI, exploring applications for other sequential
this situation. [The Complete View] does, and is clearly decision making tasks. We also plan to deploy our rationale
the best justification for the action that Frogger took generator with an collaborative NPC in an interactive
because it moves him towards his end goal. game to investigate how the perception of a collaborative
agent changes when players interact longitudinally (over an
Beyond contextual accuracy, rationales that showcase extended period of time). This longitudinal approach can
awareness of surroundings rate high on the adequate help us understand novelty effects of rationale generating
justification dimension. For instance, P11 rated the random agents. Besides NPCs, our techniaues can improve teaching
rationale low because it showed “no awareness of the and collaboration in games, especially around improvisation
surroundings”. For the same action, P11 gave high ratings and co-creative collaboration in game-level designs
for the exemplary and focused-view rationales because each Our data collection pipeline is currently designed to work
made the participant “. . . believe in the character’s ability to with discrete-action games that have natural break points
judge their surroundings.” where the player can be asked for explanations, making it
less disruptive than continuous-time and -action games. The
Understandability (D4) For this dimension, components next challenge is to extend and test our approach with more
such as contextual accuracy and relatability influence continuous spaces where states aren’t as well defined and
participants’ perceptions of how much the rationales rationales are harder to capture from moment-to-moment.
helped them understand the motivation behind the agent’s
actions. Contextually accurate rationales were found to Conclusions
have a high influence with understandability. In fact, many In this paper, we explore how human justifications for
expressed how the contextual accuracy, not the length of their actions in a video game can be used to train
the rationale, mattered when it came to understandability. a system to generate explanations for the actions of
While comparing the exemplary and focused-view rationales autonomous game-playing agents. We introduce a pipeline
for understandability, P41 made a notable observation: for automatically gathering a parallel corpus of game states
annotated with human explanations and show how this
The [exemplary and focused-view rationale] both
corpus can be used to train encoder-decoder networks. The
described the activities/objects in the immediate
resultant model thus translates the state of the game and
vicinity of the frog. However, [exemplary] was not
the action performed by the agent into natural language,
as strong as [focused-view] given the frog did not
which we call a rationale. The rationales generated by our
have to move just because of the car in front of
technique are judged better than those of a random baseline
him. [Focused-view] does a better job of providing
and close to matching the upper bound of human rationales.
understanding of the action
By enabling autonomous agents to communicate about the
Participants put themselves in the agent’s shoes and motivations for their actions, we hope to provide users with
evaluated the understandability of the rationales based on greater confidence in the agents while increasing perceptions
how relatable they were. In essence, some asked “Are these of understanding and relatability.
the same reasons I would [give] for this action?” (P43). The
more relatable the rationale was, the higher it scored for
understandability.
References [13] Marco Tulio Ribeiro, Sameer Singh, and Carlos
[1] Ashraf Abdul et al. “Trends and trajectories for Guestrin. “Why should i trust you?: Explaining the
explainable, accountable and intelligible systems: An predictions of any classifier”. In: Proceedings of
hci research agenda”. In: Proceedings of the 2018 the 22nd ACM SIGKDD international conference on
CHI Conference on Human Factors in Computing knowledge discovery and data mining. ACM. 2016,
Systems. ACM. 2018, p. 582. pp. 1135–1144.
[2] Aswin Thomas Abraham and Kevin McGee. “AI [14] Anselm Strauss and Juliet Corbin. “Grounded theory
for dynamic team-mate adaptation in games”. In: methodology”. In: Handbook of qualitative research
Computational Intelligence and Games (CIG), 2010 17 (1994), pp. 273–85.
IEEE Symposium on. IEEE. 2010, pp. 419–426. [15] Chek Tien Tan and Ho-lun Cheng. “Personality-based
[3] Maria-Virginia Aponte, Guillaume Levieux, and Adaptation for Teamwork in Game Agents.” In:
Stéphane Natkin. “Scaling the level of difficulty AIIDE. 2007, pp. 37–42.
in single player video games”. In: International [16] Sang-Won Um, Tae-Yong Kim, and Jong-Soo Choi.
Conference on Entertainment Computing. Springer. “Dynamic difficulty controlling game system”. In:
2009, pp. 24–35. IEEE Transactions on Consumer Electronics 53.2
[4] J Aronson. A pragmatic view of thematic analysis: the (2007).
qualitative report, 2,(1) Spring. 1994. [17] Jason Yosinski et al. “Understanding neural networks
[5] Reuben Binns et al. “’It’s Reducing a Human through deep visualization”. In: arXiv preprint
Being to a Percentage’: Perceptions of Justice in arXiv:1506.06579 (2015).
Algorithmic Decisions”. In: Proceedings of the 2018
CHI Conference on Human Factors in Computing
Systems. ACM. 2018, p. 377.
[6] Dennis M. Buede, Paul J. Sticha, and Elise T.
Axelrad. “Conversational Non-Player Characters
for Virtual Training”. In: Social, Cultural, and
Behavioral Modeling. Ed. by Kevin S. Xu et
al. Cham: Springer International Publishing, 2016,
pp. 389–399. ISBN: 978-3-319-39931-7.
[7] Silvia Coradeschi and Lars Karlsson. “A role-based
decision-mechanism for teams of reactive and
coordinating agents”. In: Robot Soccer World Cup.
Springer. 1997, pp. 112–122.
[8] Upol Ehsan et al. “Rationalization: A Neural
Machine Translation Approach to Generating Natural
Language Explanations”. In: Proceedings of the AAAI
Conference on Artificial Intelligence, Ethics, and
Society. Feb. 2018.
[9] Josua Krause, Adam Perer, and Kenney Ng.
“Interacting with predictions: Visual inspection of
black-box machine learning models”. In: Proceedings
of the 2016 CHI Conference on Human Factors in
Computing Systems. ACM. 2016, pp. 5686–5697.
[10] Leib Litman, Jonathan Robinson, and Tzvi
Abberbock. “TurkPrime. com: A versatile
crowdsourcing data acquisition platform for the
behavioral sciences”. In: Behavior research methods
49.2 (2017), pp. 433–442.
[11] Minh-Thang Luong, Hieu Pham, and Christopher D
Manning. “Effective approaches to attention-based
neural machine translation”. In: arXiv preprint
arXiv:1508.04025 (2015).
[12] Grant Pickett, Foaad Khosmood, and Allan Fowler.
“Automated generation of conversational non player
characters”. In: Eleventh Artificial Intelligence
and Interactive Digital Entertainment Conference.
Vol. 362. 2015.