Exploring Expertise through Visualizing Agent Policies and
        Human Strategies in Open-Ended Games
            Steven Moore                                                                                                              John Stamper
      Carnegie Mellon University                                                                                              Carnegie Mellon University
        5000 Forbes Avenue                                                                                                      5000 Forbes Avenue
        Pittsburgh, PA 15213                                                                                                    Pittsburgh, PA 15213
StevenJamesMoore@gmail.com                                                                                                    jstamper@cs.cmu.edu


ABSTRACT                                                                                  also be compared to agents. To this end, we are currently trying to
In this research, we explore the problem solving strategies of both                       understand how agents, that have met or exceeded human level
humans and AI agents in the open-ended domain of video games.                             capabilities at these games, encode strategies in their game
We utilize data collected from several human-level performing AI                          policies, and how their strategies compare to expert human
agents, that follow a given policy, and data from expert human                            players.
players, that follow a set of strategies, for two Atari 2600 console                      In the development of these agents, it is the human encoding the
games. We compare both types of data streams using a                                      strategy into the AI using their knowledge of the game. The
visualization technique to gain insights about how each player                            majority of game-playing agents, however, make use of deep
type, AI or expert human, go about solving the given games.                               neural nets to develop their policies, which makes them black box
Analyzing the action sequences of the two, we demonstrate how                             and often difficult to interpret by a human. Recent work has
closely the agent policies resemble the real-world problem solving                        looked at making policies developed this way programmatically
of a human player, and explore how we might extract human-level                           interpretable, but much work remains for humans to be able to
strategies for agent policies. We reflect on the benefits of using                        clearly articulate what many of these agents have learned from
data from both AI agents and expert humans to instruct learners,                          their training [31]. It is debatable if these deep reinforcement
model their behaviour, and how strategies may be more apparent                            learning agents make use of explicit strategies as they execute
and easier to adopt from human play. Finally, we hypothesize the                          their given policies. A recent approach uses saliency maps to
benefits of combining both types of data for learning these                               highlight key decision regions for agents in ALE, and found that
complex tasks within open-ended domains.                                                  their agent for the Space Invaders videogame learned a
                                                                                          sophisticated aiming strategy [12]. Another way to make policies
Keywords                                                                                  less black box is to break the policy down into smaller subtasks
expertise, strategy, gameplay agent, visualization, t-SNE, deep                           that are comprised of a few actions that feed back into the overall
reinforcement learning                                                                    policy [20]. These techniques of breaking down policies into
                                                                                          smaller interpretable strategies and visually representing the
1.        INTRODUCTION                                                                    mechanisms of an agent’s policy are steps toward having humans
The process of building expertise, especially in complex tasks, has                       learn strategies from agents, without directly encoding any into
been an area of study for some time in education [11]. Issues                             the agent itself.
related to the difficulty of data collection and storage, have been                       While previous work continues to reduce the amount of training
an impediment in the educational data mining (EDM) research                               data required to develop successful agents via self-learning, others
community to explore many truly open ended complex tasks. In                              look to use human games to seed agents. One such study found
this research, we are taking steps to better understand how to                            that training on human data, they could achieve comparable scores
collect, analyze, and gain a better understanding of complex                              to state-of-the-art reinforcement learning techniques and even beat
environments where human expertise in the form of strategies                              the scores using just the top 50% of their collected data for more
may be used. We have selected a classic video game system                                 complicated games, such as pinball [16]. Combining a method
environment based on the Atari 2600 console called the Arcade                             that not only trains agents on expert human data, but also encodes
Learning Environment (ALE) [2]. ALE has generated a large                                 their strategies into the form of an evaluation function, has the
amount of interest in recent years in the broader artificial                              potential to yield successful agents that require less computational
intelligence and machine learning environments as a test bed for                          time while performing at greater levels than comparable agents.
game playing agents. While the majority of work with this
environment is focused on building general game playing agents,                           Data from stochastic and adversarial domains remains challenging
we have found the environment provides a useful test bed for                              to mine, interpret, and visualize in a way that improves the
understanding how humans learn and apply strategies, which can                            understandability of the data. Video game data collected in the
                                                                                          ALE is representative of this challenging domain, while also
                                                                                          being open ended. Datamining and visualization techniques
                                                                                          applied to such data can readily be leveraged for more traditional
                                                                                          educational domains, such as solving a stoichiometry problem or
                                                                                          completing a task in a physics simulator. One technique to help
                                                                                          visualize such game data, in a way that enables us to make


                   Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
comparisons, is the use of t-distributed stochastic neighbor             skill. The value of mastery varies on skills and domains, but often
embedding (t-SNE) [19]. This is a technique used to visualize            a value of 90% or 95% are assumed to have achieved mastery [7].
high-dimensional datasets and has previously been used a few             Beyond measurements of expertise is it also important to
times to visualize and interpret game data [22, 28]. By applying         qualitatively understand the strategies associated with expertise.
t-SNE for dimensionality reduction and visualization to data,            Understanding strategies that are used to solve problems has also
similar clusters detailing potential strategies and policy enactment     been explored in many domains. Tasks to elicit knowledge from
may emerge. Visually representing the mechanisms of an agent’s           experts, such as cognitive task analysis (CTA) have been used by
policy can provide a step towards having humans learning                 cognitive scientists to better understand the 3 strategies that
strategies from these agents, gaining their expertise.                   experts use, but may not explicitly recognize [6]. In a mathematics
In complex tasks humans generate strategies which can be applied         study on word problems where students were using cognitive
in many different situations. Combinations of strategies that lead       math tutors, researchers noted three different strategies that
to optimal outcomes can lead to expertise in a domain, although          students used to solve problems [8]. These strategies included (1)
there is still no consensus among researchers as to what makes a         working backwards from the answer or unwinding, (2) plugging
person an expert and how expertise is defined. In this research we       in values in a hill climbing method, and (3) using equations. With
explore the interactions of policies and strategies, then look at        the correct structure of the problems these strategies could be
how both relate to expertise in the context of these two games.          explicitly identified.
Our long term goal is to see how humans can help teach agents
and agents can help teach humans in a continuous loop hence the
                                                                         2.2       Agent Expertise and Policies
idea of “teachable humans and teachable agents.” Specifically in         Artificial intelligence has been used now for decades to create
this work, the main contribution is a start to this goal with a novel    agents that mimic human behavior. These agents are generally
comparison of agent policies, generated with two different state of      driven by a policy created by some form of machine learning such
the art techniques on several complex game domains, and                  as reinforcement learning [29]. The policy tells the AI agent what
strategies generated from human players. We do this through the          to do given a certain set of conditions. This is most often defined
use of visualizing both types, expert human and agent, of                as a state-action graph that suggests the best possible next action
collected gameplay data using t-SNE diagrams of the state spaces         for an agent assigned to a given state [27].
as a means to compare the two. We believe this work can help
                                                                         In education, agents driven by policies have long been a
lead to a better understanding of human strategies and expertise,
                                                                         foundational part of data-driven intelligent tutors and adaptive
while also contributing to data mining techniques which can
                                                                         learning. Work has been done in modeling learning at a partially
further be used in the context of explainable AI for educational
                                                                         observable Markov decision process (POMDP), and using a
systems. The visualization and comparison techniques used can be
                                                                         policy generated to predict what a student knows and what the
extended to more traditionally educational games, to gain a sense
                                                                         next best instructional lesson is for a particular student [24]. Other
of any strategies being enacted. Additionally, it is beneficial to see
                                                                         research has been done using reinforcement learning with a focus
if the agents are solving the game in a natural way, using a
                                                                         on what pedagogical action would be best to use for a student
human-like strategy, as many similar systems are often designed
                                                                         when multiple actions are available [5]. Most closely associated
to playtest such games and act as tutors to the users.
                                                                         with the research we are doing is working on the automatic
2.        RELATED WORK                                                   generation of hints and feedback [25, 26]. This work uses state
Expertise has been the subject at the crossroads of Psychology and       graphs and reinforcement learning to identify the best path for
Computer Science for some time. One of the first compiled works          solving problems and using the state features of the next best state
came from Glaser et al., ​The Nature of Expertise [11] explored a        to generate a just in time hint, like the next optimal move in a
wide variety of domains from human typing to sports to ill               game [23]. This type of feedback can lead the student down a
defined domains. A key insight from this work is that in the early       better path for learning.
development of AI systems, expertise was tightly related to the          2.3       Comparing Human & Agent Gameplay
concept of encoding human strategies into machines, such as early
work involving chess players and intelligent tutors [4]. As work         Visualizations of gameplay data are widely popular, often
continued, there seems to be a drift from the Psychology field into      being used by players to compare their performance against
architectures of cognition defined by ACT-R [1] and Soar [17] as         others and to make sense of how they played the game [32].
examples. Computer Science moved towards agents and policy               For instance, heat maps have been used by players to refine
creation focusing early on reinforcement learning [29] and now           gameplay strategies, providing insights into popular areas
advanced techniques built on deep learning [18].
                                                                         about a game’s environment [15]. In a similar vein,
2.1       Human Expertise and Strategies                                 saliency maps have been applied to gameplay from agents,
The question of what exactly defines someone as an expert is still       acting as heat maps for activations in their neural nets [12].
an open question and has a lot to do with the particular domain          From such visualizations, it became clear that the agent was
that is being studied. In chess, Chase and Simon posited that it         enacting a form of a strategy around aiming, as a human
takes 10,000 hours of study to become an expert in chess [4]. That       player would do. Another use of saliency maps, combined
number has also been suggested as the rough number of hours to           with t-SNEs, looked to describe the policies agents were
become an expert musician [9] and is a general theory of expertise       using [34]. This was done to not only make the agents less
[10], although largely due to Simon’s chess work.
                                                                         black-box and understandable, but to see if they followed
In the case of learning systems, we often define mastery using           any set strategies.
some form of knowledge tracing. These systems often set
“mastery” as a probabilistic value that a learner knows a particular
Many games are making use of such game-playing agents                  3.1.2     Seaquest
and procedurally generated content methods to develop                  In the game Seaquest, depicted in Figure 2, the ultimate goal is to
both the game environment and to play-test the games [13].             retrieve as many scuba divers from under the water as possible.
Much like the agents developed for the ALE, these                      The player or agent controls a submarine that can navigate in all
game-playing agents play a game in order to find any bugs              directions around the screen and faces the front of the ship in the
                                                                       direction of movement, either right or left. This submarine has an
or areas of improvement. With such large amounts of data
                                                                       oxygen tank gauge that slowly diminishes over time, the player
coming from even the most simple games, many tools have                must surface their ship at the top of the screen to refill it. As they
been developed to assist in the visualization and analysis             navigate around the screen, collecting the divers, they also must
process [33]. Using visualizations is one way to gain                  dodge enemy ships and sharks that navigate across the map. If
insights into any human-like strategies being enacted by               their submarine collides with an enemy unit or the oxygen gauge
such agents. This is important as an agent might not be of             reaches zero, they lose one of their three lives. To combat these
much use if it plays the game, but not in a way that a                 enemies, they are able to shoot a projectile from the front of the
                                                                       ship, which damages or destroys these enemy units. Killing an
human user does. In open-ended games with a massive
                                                                       enemy results in a point increase, but the main increase in points
state space, mimicking as close to human play as possible              comes from saving the divers. In order to receive points for the
helps to provide the most accurate data and bug testing                collected divers, the submarine must surface by navigating to the
from the agent.                                                        very top of the map. All valid button combinations for the Atari
                                                                       2600 controller work for this game, such as up-left-fire, right-fire,
3.        METHOD                                                       and down.
3.1       Environment & Games
The Arcade Learning Environment (ALE) provides a framework
consisting of over fifty Atari 2600 games that can used to evaluate
competency in deep reinforcement learning (DRL) agents and
other types of AI [2]. Despite having a limited amount of input, a
fire button and four directional controls, many of the games
consist of complex tasks in open-ended worlds, making them a
fitting testbed for DRL agents. Using the ALE, we focused on
gameplay from two distinct games for the Atari 2600. The first
game is Space Invaders, which is one of the simpler games for the
system, consisting of just four non-combinational inputs. The
second game is Seaquest, which incorporates all input
combinations available for the Atari 2600, making it a much more
complex and challenging game for both humans and agents.
3.1.1     Space Invaders                                               Figure 2. In Seaquest the player controls the yellow
In the game Space Invaders, depicted in Figure 1, the player or        submarine, collects the scuba divers, and shoots or avoids the
agent controls a ship at the bottom of the screen that can navigate    enemies.
along a single dimension of left or right. The goal of the game is
to destroy all the enemy units above the user’s ship, gaining points
                                                                       3.2       Agent Dataset
for each enemy destroyed, while also avoiding any projectiles          As the ALE provides a framework for testing DRL agents, we
from them. If the player is struck by an enemy projectile they lose    selected two higher performing agents implemented in the
one of their three lives. To destroy these enemy units, the player’s   environment using value-based DRL algorithms. The first agent
ship can fire a projectile that goes directly up, damaging or          utilizes a Deep Q-network (DQN) and has achieved a level
destroying an enemy unit on contact. Additionally, the player can      comparable to a human professional in almost fifty games,
hide behind three objects at the bottom of the screen to avoid the     including the two we investigate [21]. Our second is an agent
enemy fire. The only valid controls for this game are left and right   known as Rainbow, which is built upon a DQN variant and has
to move the player agent and the fire button to shoot.                 achieved even greater scores across the same Atari 2600 games
                                                                       [14]. We selected the DQN agent as it is often cited as a baseline
                                                                       for this domain. The Rainbow agent was selected for its high
                                                                       scoring performance, while still mimicking human play when
                                                                       observed. For instance, Rainbow will move the player avatar
                                                                       about the screen in Seaquest, rather than stay at the very bottom of
                                                                       the screen to avoid enemies, as some agents do.
                                                                       The data for both of these agents come from the benchmarks used
                                                                       in the ​Atari Zoo​, an open-source set of trained models for six
                                                                       major DRL algorithms at varying benchmarks, collected from the
                                                                       ALE [28]. Other DRL algorithm agents implemented in ​Atari Zoo
                                                                       perform at lower levels than DQN and Rainbow, while not
Figure 1. In Space Invaders the player controls the green ship         mimicking human gameplay, such as A2C [22]. For this reason,
at the bottom of the screen and must shoot the invaders that           we did not select those agents, as we wanted high performing ones
proceed left, then down, then right.                                   for both games.
Table 1 shows the max score achieved per Space Invaders game               different DRL algorithms. As these datasets were quite large for
for the Rainbow agent, DQN agent, and expert humans. The cells             both games, we pre-processed them using Principal Component
with multiple scores in them indicate the agent or human lost all          AnalysiS (PCA) to a dimensionality of 50, then followed that with
lives during that session and restarted play within the limited            300 t-SNE iterations with a perplexity of 30 [30]. Note that t-SNE
amount of frames recorded. Thus, a single score indicates the              positions the points on a place such that the pairwise distances
agent or human did not lose all of their lives during the recorded         between them minimizes a certain criterion. As a result, the axes
play. Table 2 shows similar information, but for the game plays            can not be labeled with a specific unit, due to the high
from Seaquest. In particular for the DQN agent, as shown in the            dimensional nature of the data.
second game play, the five scores low scores indicate the agent
lost all lives and had to restart play five times in the allotted steps.   Utilizing the code provided from the ​Atari Zoo [​ 28], we are then
                                                                           able to visualize the processed agent and human data in a t-SNE
Table 1. The highest score(s) achieved for the two agents and              embedding with associated screenshots. The points in the
expert human in the collected Space Invaders data over three               resulting t-SNE embeddings represent a separate frame from the
different game plays.                                                      agent or human. They are colored corresponding to their given
                                                                           source and the transparency is used to indicate score, with a
  Space Invaders Game            DQN           Rainbow       Human         darker color indicating a higher score. The clustering of the points
                                                                           help to indicate the distributions of states, corresponding to
             1                    2380        1805,990        1685         behaviour, the agent or human visited. Additionally, the points
                                                                           can be clicked on to view a screenshot of the game. This provides
             2                 1495, 600         3750         1745         another metric for analyzing agent-collected data, in addition to
                                                                           providing a means of comparison to our collected human data.
             3                 1345, 830         3845         1845         4.        RESULTS
                                                                           4.1       Space Invaders
Table 2. The highest score(s) achieved for the two agents and              Plotting the DQN, Rainbow, and expert human data from Space
expert human in the collected Seaquest data over three                     Invaders via t-SNE, we can see both similarities and differences in
different game plays.                                                      the clustering. Figure 3 depicts a t-SNE embedding of nine Space
                                                                           Invaders games in total, three from each agent and the expert
 Seaquest Game               DQN               Rainbow       Human         human. The agent data, green depicting Rainbow and blue for
                                                                           DQN, overlaps more throughout the graph than the human data
         1                 800,1400              4960         12590        points, represented by red. A majority of the human data
                                                                           clusterings are on the bottom half of the t-SNE, where there only
                                                                           appears to be a single Rainbow and DQN cluster. There is an
         2            60, 60, 60, 60, 100        5020         14220
                                                                           equal separation of high scoring points, depicted by darker shades
                                                                           of the color, for all three parties. High scoring human points of
         3                 3900, 500             7840         16880        dark red are scattered about, while the dark blue DQN data is
                                                                           grouped toward the upper center. Above that is the dark green
                                                                           Rainbow data, that is grouped between the 100 and 150 points of
3.3       Expert Human Dataset                                             the y-axis. Ultimately while there is similar clustering of the
Using the ALE, we collected expert human data for both Space               Rainbow and DQN agents across all three games, it does not hold
Invaders and Seaquest. To collect the human game play data, we             true for when the game is coming to an end and a higher score has
modified the ALE code to record the RAM state at each frame of             been achieved. Additionally, regardless of the game’s score, the
gameplay, so that it could be compared to the agent data from the          human data does not seem to have much overlap with either agent.
Atari Zoo.​ Using Atari 2600 data collected from the Atari Grand
Challenge project as a baseline for Space Invaders, our collected
expert human data ranks in the top 1% based on scores [16]. We
were unable to use the collected data from the Atari Grand
Challenge, as we needed the RAM states in order to visualize the
data in the ​Atari Zoo​.

3.4       Visualizing
A popular technique used for dimensionality reduction and
visualization of high-dimensional data used with large
reinforcement learning datasets is t-SNE [22]. It provides a way to
plot the data, from both agents and humans, along varying
dimensions, clustering the related frames to one another. Our data,
for both agents and humans, consisted of the Atari RAM
representation, which is the same across agent algorithms and
runs, but distinct between the games. Traditionally, the use of
t-SNE embeddings are for a single high-level representation of an
agent. However, since our datasets are all from the Atari RAM
representation, this enables us to make comparisons between
different runs of an agent for the same algorithm and runs from
Figure 3. Two-dimensional t-SNE embedding of Space
Invaders gameplay collected from three games using the
Rainbow agent, depicted in green, three games from the DQN
agent, depicted in blue, and three games from the expert
human, depicted in red, for nine games in total.
To further identify any interesting clustering of the points, we
selected a single game play from the two agents and the human
data, so the t-SNE would show one from each for a total of three,
instead of the aforementioned nine. The resulting t-SNE for this is
shown in Figure 4, along with screenshots that are representative
of the major clusters. We included screenshots for six clusters,
two from each, that are darker in color corresponding to a higher
score and being further along in the game. Since this depicts a
later point in the game, any key moves or strategies are more
visible since they have had time to be enacted. With a single game
for each agent or human depicted, the representative clusters stand
out even more.

                                                                      Figure 5. Two-dimensional t-SNE embedding of Seaquest
                                                                      gameplay collected from three games using the Rainbow
                                                                      agent, depicted in green, three games from the DQN agent,
                                                                      depicted in blue, and three games from the expert human,
                                                                      depicted in red, for nine games in total.
                                                                      As the resulting data appears to be fairly scattered for all nine
                                                                      games of Seaquest, we selected just the third play through for both
                                                                      agents and the humans and displayed it via t-SNE, shown in
                                                                      Figure 6. With just a single game from each source displaying,
                                                                      several clusterings became more apparent. The Rainbow agent has
                                                                      three distinct clusters, two of which overlap with the DQN agent.
                                                                      Screenshots from these two clusters depict the player unit, the
                                                                      yellow submarine, towards the center of the screen with no
                                                                      enemies around. The representative screenshots depicting the
                                                                      expert human data, via the red points, show the shit less towards
                                                                      the center and with more enemy units about. This suggests a
                                                                      potential difference in gameplay between the agents and the
                                                                      human, that we elaborate on in the following discussion.


Figure 4. A t-SNE for a single game of Space Invaders from
the green Rainbow agent, blue DQN agent, and red expert
human with screenshots depicting the largest and high scoring
clusters.

4.2       Seaquest
Following the same steps of the Space Invaders data, we plotted
the collected Seaquest gameplay data via t-SNE. Figure 5 depicts
the t-SNE embedding of nine Seaquest games in total, three from
each agent and three collected from the expert human. Similar to
the Space Invaders t-SNE embedding, the two Rainbow and DQN
agents, represented by green and blue respectively, overlap more
with one another than the expert human data, represented by the
red points. However, all three types of points about the diagram
are much less clustered into groups and more spread out
throughout the given range, indicating a greater variance of game
states between the two agents and the expert human. One notable
clustering resulting from all nines games is a grouping in the        Figure 6. Representative screenshots for the various clusters
center, where the Rainbow agent, shown in green, almost               about the t-SNE embedding for the third play of Seaquest
perfectly overlaps the DQN agent, shown in blue. For this cluster,    from each agent and human dataset. Green points represent
the points for both agents are also darker, indicating they are for   the Rainbow agent, blue points for the DQN agent, and red
higher scoring states that occur later during the game play.          ones for human.
5.        DISCUSSION                                                     Another analysis of the same t-SNE plot for a game of Space
Plotting the game data via t-SNE provides a concise visualization        Invaders provided insights into a distinct shooting strategy the two
of such high-dimensional data. However, they are only beneficial         agents and human each had. When we inspected clusters and
to us if their clusterings detail any patterns that might be             points that represented the game at a halfway point, where half the
indicative of a strategy or interesting behaviour. One immediate         enemy units on the screen were destroyed and the other half alive,
clustering that caught our eye was for the t-SNE depicting a single      we noticed an interesting pattern in the configuration of the
game of Space Invaders for the two agents and humans. As Figure          remaining enemies. As Figure 8 shows, the Rainbow and DQN
7 shows, there is a clear dark red cluster of expert human data          agents target enemies either horizontally across the bottom or in a
toward the center, indicating there are many similar game states         diagonal pattern. However, the expert human destroys the enemy
here and ones with a high game score comparatively. Examining            units starting from the left column and working right. While each
the points on this curving cluster, we noticed the screenshots           of these represents a different shooting strategy for the given
representing the game states at the time had a clear similarity. The     player, the expert human’s strategy is debatably the most optimal.
states in this cluster were for when a single enemy ship was left        For Space Invaders, the enemy units move about the screen
on the map, the point right before the player can advance to the         horizontally, and once they reach the edge of the screen they
next stage. It became clear that human player had difficulty hitting     move down a single row, and continue moving the opposite
the last few enemies, as they move fast and requires precise             horizontal direction. This means that if there are few enemy
aiming when there are not many left. A nearby cluster from the           columns, it takes the enemies a greater time to horizontally
Rainbow agent, represented in green and highlighted in Figure 7          traverse, allowing the player more time to fire at them.
too, depicts a similar set of states. However, there are not as many     There are trade-offs for this strategy though, as if the bottom row
points for the agent in this set of states as there are the human,       of the enemy units comes into contact with the ground, the game
suggesting the agent can more accurately hit the fast moving last        is over. This may be the reason why the deep reinforcement
remaining enemies.                                                       learning trained agents shoot in a horizontal or diagonal pattern,
While this is not a particular strategy, it does provide insights into   so that they keep the bottom row higher up to avoid the game over
similar difficulties both agent and human have in the game. It also      condition, something they must have encountered quite often
aligns with the maximum scores both agent and human achieved,            during their early training phases. However, this does not translate
as the agent spent less time on this phase and could advance             to an optimal strategy as the expert human data reveals. Teaching
through the game more rapidly, achieving a higher score in the           a player the game using such agents could lead to the adoption of
allotted time, which one might equate to expertise in this domain.       this firing strategy, which would be suboptimal compared to that
It is a case where the Rainbow agent is reflecting a difficulty also     of the human’s. Such a case could also readily apply to more
encountered by the human. If this was in the context of an               educational game contexts, as an agent or tutor that learned to
educational game, we could use the agent’s data to gain insights         play or solve the problem may be doing so in a non-optimal way
into where a hint or other feedback might be the most optimal, as        compared to that of an expert human. Even though the “score” for
it is a clear point of difficulty. Additionally, the DQN agent did       a given game is greater, ultimately learning the better strategy
not demonstrate such difficult. If just the DQN agent’s data was         would have a greater pay off in the long run.
used for such playtesting, this area of struggle may have been
missed altogether. This insight was provided through a brief
visual inspection enabled via t-SNE, that may not be as readily
clear from parsing log data.


                                                                         Figure 8. The DQN and Rainbow agent cluster towards the
                                                                         bottom shows that half way through the game, they keep the
                                                                         enemies clustered. This can be observed from the human data,
                                                                         however instead of going across the bottom or diagonal, they
Figure 7. A long clustering of red points, representing human            start from the left most column and work right.
data, with screenshots from the top and bottom, showing a
                                                                         Examining the t-SNE for a game of Seaquest also revealed
pattern of the player attempting to destroy the last remaining
                                                                         insights into the differing navigational strategies used by both
enemies. The nearby green points, for the Rainbow agent,
                                                                         agent and human. For this t-SNE plotting, there was less
represents a similar clustering as the agent finishes the final
                                                                         clustering compared to the Space Invaders ones. However, as we
enemy.
                                                                         investigated the different points and viewed the screenshots for
representative states, a pattern with the player-controlled          emerge in the visualizations and allow for some interesting
submarine emerged. For both the DQN and Rainbow agent, the           comparisons between the two. There are clear implications of
submarine remained towards the bottom of the map and stayed          using just an agent’s gameplay, as the enacted strategies may be
centered on the y-axis, unless they were briefly moving to rescue    optimal, but limiting to a user’s play. They also might
a diver. However, the human points showed the submarine in a         demonstrate a clear strategy, such as the firing configuration in
variety of positions that were far from the center or bottom axis,   Space Invaders, yet such a strategy could actually be sub-optimal
even without the presence of these scuba divers. As Figure 9         for a human to enact.
depicts, the human made use of more free moving navigational
behavior, traversing the entire map and getting towards the edges    While these two games are not traditional educational ones, the
to allow themselves more time to position and fire at enemies. The   implications of the techniques used and insights gained are still
agents, who presumably had better accuracy from their mass           applicable to ones in such a context. Eliciting strategies,
training, could remain towards the center and only leave the         regardless of coming from an AI system or human, is challenging
bottom when they had to move upwards to fire at an enemy or          and such visualizations provide one way to search for and
surface for oxygen.                                                  understand them. At present, the use of agents using similar
                                                                     mechanisms and reinforcement learning methods to solve
                                                                     problems then instruct students agents [3] could benefit from the
                                                                     use of t-SNE visualization of the collected data. They want to
                                                                     ensure the strategies and suggested instruction are optimal, while
                                                                     remaining natural as a human would act. As it is not useful if a
                                                                     human cannot enact a particular suggested strategy, due to an
                                                                     agent having different control during the training process, such as
                                                                     access to frame-by-frame data in the game, causing it to have
                                                                     greater accuracy.

                                                                     6.        CONCLUSION & FUTURE WORK
                                                                     Our primary goal in this work is to explore expertise, in this case
                                                                     in the context of games. In such games, prior work often uses the
                                                                     score as a measure of how expert a player, either human or agent,
                                                                     is at the game. We believe in addition to the score, the strategies
                                                                     used to solve the game impact how expertise, in this domain, can
                                                                     be quantified. To gain insights into such strategies we visualized
                                                                     gameplay data of a high scoring and long time playing human,
Figure 9. The red human data shows that they move about the          deemed an expert, and high scoring agents gameplay data via
screen more compared to the agents, depicted in blue and             t-SNE. Analysis of the resulting t-SNEs yielded insights into both
green, who most often end up in the center of the y-axis,            shared and differing strategies the two parties had. Even between
particularly towards the bottom.                                     agents, there existed similar and dissimilar strategies, in addition
                                                                     to their score variance. Taking into account these gameplay
Similarly to the Space Invaders strategy of creating different       differences and how realistic an enacted strategy might be for a
enemy configurations, the agents for Seaquest had their own          human to learn from or mimick is important for game and tutor
strategy that differed from the expert human gameplay. In this       developers to keep in mind when using agents as playtesters or
game’s context, there is not necessarily a clear benefit of one      instructors. A strategy might be seem beneficial, yet compared to
navigational strategy over the other. However, the one used by the   a different one it may not be as optimal nor practical for a player
agents might be better suited for a player who has better aim and    or learner to utilize in their own gameplay.
does not need to get their avatar close to the enemy units. If a
human user learned from the actions of these agents though, they     As we continue this work, we want to extend it to more games
might not move around the map as the expert human gameplay           other than Space Invaders and Seaquest, particularly ones in the
did. While not necessarily impacting the score, it could impact      educational space that also have accompanying agent-collected
their enjoyment of the game, as they will be making less             data. Further inspection remains to be done to draw more
movements and have less control over their avatar, compared to       strategies from the accompanying visualizations. Following this,
treating it as basically fixed along the y-axis. This is another     we will further look into how they cluster, indicating the
consideration of using just an agent to playtest or learn from, as   performance of similar strategies based on their policies. One key
even when it might not impact performance, other factors like        area we plan to explore is adding a temporal aspect to the t-SNE
enjoyment might be impacted from the enactment of certain agent      graphs. Although not represented in our current visualizations, we
performed strategies.                                                do have the screenshots numbered temporarily, so we expect that
                                                                     we can connect the paths to show the progression of game play.
This research represents our initial exploratory work into           Additionally, visualizing novice human data, in addition to the
understanding expertise of complex tasks in open ended domains       expert and agent data, could provide useful strategy comparisons.
using a combination of human and artificial intelligence agents.     This could help developers of educational games find where their
We have begun by plotting expert human and human-level agents        novice learners seem to struggle the most, from a visual
using t-SNEs to provide a way for us to visualize data. We can see   standpoint.
from the plotted t-SNEs that expert human data does have some
overlap with data from high performing DRL agents, however,          7.        ACKNOWLEDGMENTS
gaps exist where humans clusters are far away from the agent         The research reported here was supported in part by a training
data. Nevertheless strategies from both human and agent data         grant from the Institute of Education Sciences (R305B150008).
Opinions expressed do not represent the views of the U.S.                    arXiv:1705.10998.​ (2017).
Department of Education.                                                [17] Laird, J.E., Newell, A. and Rosenbloom, P.S. 1987. Soar: An
                                                                             architecture for general intelligence. ​Artificial intelligence.​
8.        REFERENCES                                                         33, 1 (1987), 1–64.
[1]  Anderson, J.R., Matessa, M. and Lebiere, C. 1997. ACT-R:           [18] LeCun, Y., Bengio, Y. and Hinton, G. 2015. Deep learning.
     A theory of higher level cognition and its relation to visual           nature.​ 521, 7553 (2015), 436.
     attention. ​Human-Computer Interaction​. 12, 4 (1997),             [19] LJPvd, M. and Hinton, G.E. 2008. Visualizing
     439–462.                                                                high-dimensional data using t-SNE. ​Journal of Machine
[2] Bellemare, M.G., Naddaf, Y., Veness, J. and Bowling, M.                  Learning Research​. 9, (2008), 2579–605.
     2013. The arcade learning environment: An evaluation               [20] Lyu, D., Yang, F., Liu, B. and Gustafson, S. 2018. SDRL:
     platform for general agents. ​Journal of Artificial Intelligence        Interpretable and Data-efficient Deep Reinforcement
     Research​. 47, (2013), 253–279.                                         Learning Leveraging Symbolic Planning. ​arXiv:1811.00090
[3] Chaplot, D.S., MacLellan, C., Salakhutdinov, R. and                      [cs]​. (Oct. 2018).
     Koedinger, K. 2018. Learning Cognitive Models Using                [21] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and
     Neural Networks. ​International Conference on Artificial                Dean, J. 2013. Distributed representations of words and
     Intelligence in Education​ (2018), 43–56.                               phrases and their compositionality. ​Advances in neural
[4] Chase, W.G. and Simon, H.A. 1973. Perception in chess.                   information processing systems​ (2013), 3111–3119.
     Cognitive psychology​. 4, 1 (1973), 55–81.                         [22] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness,
[5] Chi, M., VanLehn, K., Litman, D. and Jordan, P. 2011. An                 J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland,
     evaluation of pedagogical tutorial tactics for a natural                A.K. and Ostrovski, G. 2015. Human-level control through
     language tutoring system: A reinforcement learning                      deep reinforcement learning. ​Nature.​ 518, 7540 (2015), 529.
     approach. ​International Journal of Artificial Intelligence in     [23] Moore, S. and Stamper, J. 2019. Decision Support for an
     Education​. 21, 1–2 (2011), 83–113.                                     Adversarial Game Environment Using Automatic Hint
[6] Clark, R.E. and Estes, F. 1996. Cognitive task analysis for              Generation. ​International Conference on Intelligent Tutoring
     training. ​International Journal of Educational Research​. 25,          Systems​ (2019), 82–88.
     5 (1996), 403–417.                                                 [24] Rafferty, A.N., Brunskill, E., Griffiths, T.L. and Shafto, P.
[7] Corbett, A.T. and Anderson, J.R. 1994. Knowledge tracing:                2011. Faster teaching by POMDP planning. ​International
     Modeling the acquisition of procedural knowledge. ​User                 Conference on Artificial Intelligence in Education​ (2011),
     modeling and user-adapted interaction​. 4, 4 (1994),                    280–287.
     253–278.                                                           [25] Stamper, J. and Barnes, T. 2009. Unsupervised MDP Value
[8] Croteau, E.A., Heffernan, N.T. and Koedinger, K.R. 2004.                 Selection for Automating ITS Capabilities. ​International
     Why are algebra word problems difficult? Using tutorial log             Working Group on Educational Data Mining.​ (2009).
     files and the power law of learning to select the best fitting     [26] Stamper, J., Barnes, T. and Croy, M. 2011. Enhancing the
     cognitive model. ​International Conference on Intelligent               automatic generation of hints with expert seeding.
     Tutoring Systems​ (2004), 240–250.                                      International Journal of Artificial Intelligence in Education.​
[9] Ericsson, K.A., Prietula, M.J. and Cokely, E.T. 2007. The                21, 1–2 (2011), 153–167.
     making of an expert. ​Harvard business review​. 85, 7/8            [27] Stamper, J. and Moore, S. 2019. Exploring Teachable
     (2007), 114.                                                            Humans and Teachable Agents: Human Strategies versus
[10] Ericsson, K.A. and Smith, J. 1991. ​Toward a general theory             Agent Policies and the Basis of Expertise. ​International
     of expertise: Prospects and limits​. Cambridge University               Conference on Artificial Intelligence in Education​ (2019).
     Press.                                                             [28] Such, F.P., Madhavan, V., Liu, R., Wang, R., Castro, P.S.,
[11] Glaser, R., Chi, M.T. and Farr, M.J. 1985. ​The nature of               Li, Y., Schubert, L., Bellemare, M., Clune, J. and Lehman, J.
     expertise.​ National Center for Research in Vocational                  2018. An atari model zoo for analyzing, visualizing, and
     Education Columbus, OH.                                                 comparing deep reinforcement learning agents. ​arXiv
[12] Greydanus, S., Koul, A., Dodge, J. and Fern, A. 2017.                   preprint arXiv:1812.07069​. (2018).
     Visualizing and Understanding Atari Agents.                        [29] Sutton, R.S. and Barto, A.G. 2018. ​Reinforcement learning:
     arXiv:1711.00138 [cs].​ (Oct. 2017).                                    An introduction​. MIT press.
[13] Guckelsberger, C., Salge, C., Gow, J. and Cairns, P. 2017.         [30] Van Der Maaten, L. 2014. Accelerating t-SNE using
     Predicting Player Experience Without the Player.: An                    tree-based algorithms. ​The Journal of Machine Learning
     Exploratory Study. ​Proceedings of the Annual Symposium                 Research​. 15, 1 (2014), 3221–3245.
     on Computer-Human Interaction in Play​ (New York, NY,              [31] Verma, A., Murali, V., Singh, R., Kohli, P. and Chaudhuri,
     USA, 2017), 305–315.                                                    S. 2018. Programmatically interpretable reinforcement
[14] Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T.,                   learning. ​arXiv preprint arXiv:1804.02477​. (2018).
     Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M.          [32] Wallner, G. Play-Graph: A Methodology and Visualization
     and Silver, D. 2018. Rainbow: Combining improvements in                 Approach for the Analysis of Gameplay Data. 8.
     deep reinforcement learning. ​Thirty-Second AAAI                   [33] Wallner, G. and Kriglstein, S. 2012. A Spatiotemporal
     Conference on Artificial Intelligence​ (2018).                          Visualization Approach for the Analysis of Gameplay Data.
[15] Kriglstein, S., Wallner, G. and Pohl, M. 2014. A User Study             Proceedings of the SIGCHI Conference on Human Factors
     of Different Gameplay Visualizations. ​Proceedings of the               in Computing Systems​ (New York, NY, USA, 2012),
     SIGCHI Conference on Human Factors in Computing                         1115–1124.
     Systems​ (New York, NY, USA, 2014), 361–370.                       [34] Zahavy, T., Ben-Zrihem, N. and Mannor, S. 2016. Graying
[16] Kurin, V., Nowozin, S., Hofmann, K., Beyer, L. and Leibe,               the black box: Understanding dqns. ​International
     B. 2017. The atari grand challenge dataset. ​arXiv preprint             Conference on Machine Learning​ (2016), 1899–1908.