Exploring Expertise through Visualizing Agent Policies and Human Strategies in Open-Ended Games Steven Moore John Stamper Carnegie Mellon University Carnegie Mellon University 5000 Forbes Avenue 5000 Forbes Avenue Pittsburgh, PA 15213 Pittsburgh, PA 15213 StevenJamesMoore@gmail.com jstamper@cs.cmu.edu ABSTRACT also be compared to agents. To this end, we are currently trying to In this research, we explore the problem solving strategies of both understand how agents, that have met or exceeded human level humans and AI agents in the open-ended domain of video games. capabilities at these games, encode strategies in their game We utilize data collected from several human-level performing AI policies, and how their strategies compare to expert human agents, that follow a given policy, and data from expert human players. players, that follow a set of strategies, for two Atari 2600 console In the development of these agents, it is the human encoding the games. We compare both types of data streams using a strategy into the AI using their knowledge of the game. The visualization technique to gain insights about how each player majority of game-playing agents, however, make use of deep type, AI or expert human, go about solving the given games. neural nets to develop their policies, which makes them black box Analyzing the action sequences of the two, we demonstrate how and often difficult to interpret by a human. Recent work has closely the agent policies resemble the real-world problem solving looked at making policies developed this way programmatically of a human player, and explore how we might extract human-level interpretable, but much work remains for humans to be able to strategies for agent policies. We reflect on the benefits of using clearly articulate what many of these agents have learned from data from both AI agents and expert humans to instruct learners, their training [31]. It is debatable if these deep reinforcement model their behaviour, and how strategies may be more apparent learning agents make use of explicit strategies as they execute and easier to adopt from human play. Finally, we hypothesize the their given policies. A recent approach uses saliency maps to benefits of combining both types of data for learning these highlight key decision regions for agents in ALE, and found that complex tasks within open-ended domains. their agent for the Space Invaders videogame learned a sophisticated aiming strategy [12]. Another way to make policies Keywords less black box is to break the policy down into smaller subtasks expertise, strategy, gameplay agent, visualization, t-SNE, deep that are comprised of a few actions that feed back into the overall reinforcement learning policy [20]. These techniques of breaking down policies into smaller interpretable strategies and visually representing the 1. INTRODUCTION mechanisms of an agent’s policy are steps toward having humans The process of building expertise, especially in complex tasks, has learn strategies from agents, without directly encoding any into been an area of study for some time in education [11]. Issues the agent itself. related to the difficulty of data collection and storage, have been While previous work continues to reduce the amount of training an impediment in the educational data mining (EDM) research data required to develop successful agents via self-learning, others community to explore many truly open ended complex tasks. In look to use human games to seed agents. One such study found this research, we are taking steps to better understand how to that training on human data, they could achieve comparable scores collect, analyze, and gain a better understanding of complex to state-of-the-art reinforcement learning techniques and even beat environments where human expertise in the form of strategies the scores using just the top 50% of their collected data for more may be used. We have selected a classic video game system complicated games, such as pinball [16]. Combining a method environment based on the Atari 2600 console called the Arcade that not only trains agents on expert human data, but also encodes Learning Environment (ALE) [2]. ALE has generated a large their strategies into the form of an evaluation function, has the amount of interest in recent years in the broader artificial potential to yield successful agents that require less computational intelligence and machine learning environments as a test bed for time while performing at greater levels than comparable agents. game playing agents. While the majority of work with this environment is focused on building general game playing agents, Data from stochastic and adversarial domains remains challenging we have found the environment provides a useful test bed for to mine, interpret, and visualize in a way that improves the understanding how humans learn and apply strategies, which can understandability of the data. Video game data collected in the ALE is representative of this challenging domain, while also being open ended. Datamining and visualization techniques applied to such data can readily be leveraged for more traditional educational domains, such as solving a stoichiometry problem or completing a task in a physics simulator. One technique to help visualize such game data, in a way that enables us to make Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). comparisons, is the use of t-distributed stochastic neighbor skill. The value of mastery varies on skills and domains, but often embedding (t-SNE) [19]. This is a technique used to visualize a value of 90% or 95% are assumed to have achieved mastery [7]. high-dimensional datasets and has previously been used a few Beyond measurements of expertise is it also important to times to visualize and interpret game data [22, 28]. By applying qualitatively understand the strategies associated with expertise. t-SNE for dimensionality reduction and visualization to data, Understanding strategies that are used to solve problems has also similar clusters detailing potential strategies and policy enactment been explored in many domains. Tasks to elicit knowledge from may emerge. Visually representing the mechanisms of an agent’s experts, such as cognitive task analysis (CTA) have been used by policy can provide a step towards having humans learning cognitive scientists to better understand the 3 strategies that strategies from these agents, gaining their expertise. experts use, but may not explicitly recognize [6]. In a mathematics In complex tasks humans generate strategies which can be applied study on word problems where students were using cognitive in many different situations. Combinations of strategies that lead math tutors, researchers noted three different strategies that to optimal outcomes can lead to expertise in a domain, although students used to solve problems [8]. These strategies included (1) there is still no consensus among researchers as to what makes a working backwards from the answer or unwinding, (2) plugging person an expert and how expertise is defined. In this research we in values in a hill climbing method, and (3) using equations. With explore the interactions of policies and strategies, then look at the correct structure of the problems these strategies could be how both relate to expertise in the context of these two games. explicitly identified. Our long term goal is to see how humans can help teach agents and agents can help teach humans in a continuous loop hence the 2.2 Agent Expertise and Policies idea of “teachable humans and teachable agents.” Specifically in Artificial intelligence has been used now for decades to create this work, the main contribution is a start to this goal with a novel agents that mimic human behavior. These agents are generally comparison of agent policies, generated with two different state of driven by a policy created by some form of machine learning such the art techniques on several complex game domains, and as reinforcement learning [29]. The policy tells the AI agent what strategies generated from human players. We do this through the to do given a certain set of conditions. This is most often defined use of visualizing both types, expert human and agent, of as a state-action graph that suggests the best possible next action collected gameplay data using t-SNE diagrams of the state spaces for an agent assigned to a given state [27]. as a means to compare the two. We believe this work can help In education, agents driven by policies have long been a lead to a better understanding of human strategies and expertise, foundational part of data-driven intelligent tutors and adaptive while also contributing to data mining techniques which can learning. Work has been done in modeling learning at a partially further be used in the context of explainable AI for educational observable Markov decision process (POMDP), and using a systems. The visualization and comparison techniques used can be policy generated to predict what a student knows and what the extended to more traditionally educational games, to gain a sense next best instructional lesson is for a particular student [24]. Other of any strategies being enacted. Additionally, it is beneficial to see research has been done using reinforcement learning with a focus if the agents are solving the game in a natural way, using a on what pedagogical action would be best to use for a student human-like strategy, as many similar systems are often designed when multiple actions are available [5]. Most closely associated to playtest such games and act as tutors to the users. with the research we are doing is working on the automatic 2. RELATED WORK generation of hints and feedback [25, 26]. This work uses state Expertise has been the subject at the crossroads of Psychology and graphs and reinforcement learning to identify the best path for Computer Science for some time. One of the first compiled works solving problems and using the state features of the next best state came from Glaser et al., ​The Nature of Expertise [11] explored a to generate a just in time hint, like the next optimal move in a wide variety of domains from human typing to sports to ill game [23]. This type of feedback can lead the student down a defined domains. A key insight from this work is that in the early better path for learning. development of AI systems, expertise was tightly related to the 2.3 Comparing Human & Agent Gameplay concept of encoding human strategies into machines, such as early work involving chess players and intelligent tutors [4]. As work Visualizations of gameplay data are widely popular, often continued, there seems to be a drift from the Psychology field into being used by players to compare their performance against architectures of cognition defined by ACT-R [1] and Soar [17] as others and to make sense of how they played the game [32]. examples. Computer Science moved towards agents and policy For instance, heat maps have been used by players to refine creation focusing early on reinforcement learning [29] and now gameplay strategies, providing insights into popular areas advanced techniques built on deep learning [18]. about a game’s environment [15]. In a similar vein, 2.1 Human Expertise and Strategies saliency maps have been applied to gameplay from agents, The question of what exactly defines someone as an expert is still acting as heat maps for activations in their neural nets [12]. an open question and has a lot to do with the particular domain From such visualizations, it became clear that the agent was that is being studied. In chess, Chase and Simon posited that it enacting a form of a strategy around aiming, as a human takes 10,000 hours of study to become an expert in chess [4]. That player would do. Another use of saliency maps, combined number has also been suggested as the rough number of hours to with t-SNEs, looked to describe the policies agents were become an expert musician [9] and is a general theory of expertise using [34]. This was done to not only make the agents less [10], although largely due to Simon’s chess work. black-box and understandable, but to see if they followed In the case of learning systems, we often define mastery using any set strategies. some form of knowledge tracing. These systems often set “mastery” as a probabilistic value that a learner knows a particular Many games are making use of such game-playing agents 3.1.2 Seaquest and procedurally generated content methods to develop In the game Seaquest, depicted in Figure 2, the ultimate goal is to both the game environment and to play-test the games [13]. retrieve as many scuba divers from under the water as possible. Much like the agents developed for the ALE, these The player or agent controls a submarine that can navigate in all game-playing agents play a game in order to find any bugs directions around the screen and faces the front of the ship in the direction of movement, either right or left. This submarine has an or areas of improvement. With such large amounts of data oxygen tank gauge that slowly diminishes over time, the player coming from even the most simple games, many tools have must surface their ship at the top of the screen to refill it. As they been developed to assist in the visualization and analysis navigate around the screen, collecting the divers, they also must process [33]. Using visualizations is one way to gain dodge enemy ships and sharks that navigate across the map. If insights into any human-like strategies being enacted by their submarine collides with an enemy unit or the oxygen gauge such agents. This is important as an agent might not be of reaches zero, they lose one of their three lives. To combat these much use if it plays the game, but not in a way that a enemies, they are able to shoot a projectile from the front of the ship, which damages or destroys these enemy units. Killing an human user does. In open-ended games with a massive enemy results in a point increase, but the main increase in points state space, mimicking as close to human play as possible comes from saving the divers. In order to receive points for the helps to provide the most accurate data and bug testing collected divers, the submarine must surface by navigating to the from the agent. very top of the map. All valid button combinations for the Atari 2600 controller work for this game, such as up-left-fire, right-fire, 3. METHOD and down. 3.1 Environment & Games The Arcade Learning Environment (ALE) provides a framework consisting of over fifty Atari 2600 games that can used to evaluate competency in deep reinforcement learning (DRL) agents and other types of AI [2]. Despite having a limited amount of input, a fire button and four directional controls, many of the games consist of complex tasks in open-ended worlds, making them a fitting testbed for DRL agents. Using the ALE, we focused on gameplay from two distinct games for the Atari 2600. The first game is Space Invaders, which is one of the simpler games for the system, consisting of just four non-combinational inputs. The second game is Seaquest, which incorporates all input combinations available for the Atari 2600, making it a much more complex and challenging game for both humans and agents. 3.1.1 Space Invaders Figure 2. In Seaquest the player controls the yellow In the game Space Invaders, depicted in Figure 1, the player or submarine, collects the scuba divers, and shoots or avoids the agent controls a ship at the bottom of the screen that can navigate enemies. along a single dimension of left or right. The goal of the game is to destroy all the enemy units above the user’s ship, gaining points 3.2 Agent Dataset for each enemy destroyed, while also avoiding any projectiles As the ALE provides a framework for testing DRL agents, we from them. If the player is struck by an enemy projectile they lose selected two higher performing agents implemented in the one of their three lives. To destroy these enemy units, the player’s environment using value-based DRL algorithms. The first agent ship can fire a projectile that goes directly up, damaging or utilizes a Deep Q-network (DQN) and has achieved a level destroying an enemy unit on contact. Additionally, the player can comparable to a human professional in almost fifty games, hide behind three objects at the bottom of the screen to avoid the including the two we investigate [21]. Our second is an agent enemy fire. The only valid controls for this game are left and right known as Rainbow, which is built upon a DQN variant and has to move the player agent and the fire button to shoot. achieved even greater scores across the same Atari 2600 games [14]. We selected the DQN agent as it is often cited as a baseline for this domain. The Rainbow agent was selected for its high scoring performance, while still mimicking human play when observed. For instance, Rainbow will move the player avatar about the screen in Seaquest, rather than stay at the very bottom of the screen to avoid enemies, as some agents do. The data for both of these agents come from the benchmarks used in the ​Atari Zoo​, an open-source set of trained models for six major DRL algorithms at varying benchmarks, collected from the ALE [28]. Other DRL algorithm agents implemented in ​Atari Zoo perform at lower levels than DQN and Rainbow, while not Figure 1. In Space Invaders the player controls the green ship mimicking human gameplay, such as A2C [22]. For this reason, at the bottom of the screen and must shoot the invaders that we did not select those agents, as we wanted high performing ones proceed left, then down, then right. for both games. Table 1 shows the max score achieved per Space Invaders game different DRL algorithms. As these datasets were quite large for for the Rainbow agent, DQN agent, and expert humans. The cells both games, we pre-processed them using Principal Component with multiple scores in them indicate the agent or human lost all AnalysiS (PCA) to a dimensionality of 50, then followed that with lives during that session and restarted play within the limited 300 t-SNE iterations with a perplexity of 30 [30]. Note that t-SNE amount of frames recorded. Thus, a single score indicates the positions the points on a place such that the pairwise distances agent or human did not lose all of their lives during the recorded between them minimizes a certain criterion. As a result, the axes play. Table 2 shows similar information, but for the game plays can not be labeled with a specific unit, due to the high from Seaquest. In particular for the DQN agent, as shown in the dimensional nature of the data. second game play, the five scores low scores indicate the agent lost all lives and had to restart play five times in the allotted steps. Utilizing the code provided from the ​Atari Zoo [​ 28], we are then able to visualize the processed agent and human data in a t-SNE Table 1. The highest score(s) achieved for the two agents and embedding with associated screenshots. The points in the expert human in the collected Space Invaders data over three resulting t-SNE embeddings represent a separate frame from the different game plays. agent or human. They are colored corresponding to their given source and the transparency is used to indicate score, with a Space Invaders Game DQN Rainbow Human darker color indicating a higher score. The clustering of the points help to indicate the distributions of states, corresponding to 1 2380 1805,990 1685 behaviour, the agent or human visited. Additionally, the points can be clicked on to view a screenshot of the game. This provides 2 1495, 600 3750 1745 another metric for analyzing agent-collected data, in addition to providing a means of comparison to our collected human data. 3 1345, 830 3845 1845 4. RESULTS 4.1 Space Invaders Table 2. The highest score(s) achieved for the two agents and Plotting the DQN, Rainbow, and expert human data from Space expert human in the collected Seaquest data over three Invaders via t-SNE, we can see both similarities and differences in different game plays. the clustering. Figure 3 depicts a t-SNE embedding of nine Space Invaders games in total, three from each agent and the expert Seaquest Game DQN Rainbow Human human. The agent data, green depicting Rainbow and blue for DQN, overlaps more throughout the graph than the human data 1 800,1400 4960 12590 points, represented by red. A majority of the human data clusterings are on the bottom half of the t-SNE, where there only appears to be a single Rainbow and DQN cluster. There is an 2 60, 60, 60, 60, 100 5020 14220 equal separation of high scoring points, depicted by darker shades of the color, for all three parties. High scoring human points of 3 3900, 500 7840 16880 dark red are scattered about, while the dark blue DQN data is grouped toward the upper center. Above that is the dark green Rainbow data, that is grouped between the 100 and 150 points of 3.3 Expert Human Dataset the y-axis. Ultimately while there is similar clustering of the Using the ALE, we collected expert human data for both Space Rainbow and DQN agents across all three games, it does not hold Invaders and Seaquest. To collect the human game play data, we true for when the game is coming to an end and a higher score has modified the ALE code to record the RAM state at each frame of been achieved. Additionally, regardless of the game’s score, the gameplay, so that it could be compared to the agent data from the human data does not seem to have much overlap with either agent. Atari Zoo.​ Using Atari 2600 data collected from the Atari Grand Challenge project as a baseline for Space Invaders, our collected expert human data ranks in the top 1% based on scores [16]. We were unable to use the collected data from the Atari Grand Challenge, as we needed the RAM states in order to visualize the data in the ​Atari Zoo​. 3.4 Visualizing A popular technique used for dimensionality reduction and visualization of high-dimensional data used with large reinforcement learning datasets is t-SNE [22]. It provides a way to plot the data, from both agents and humans, along varying dimensions, clustering the related frames to one another. Our data, for both agents and humans, consisted of the Atari RAM representation, which is the same across agent algorithms and runs, but distinct between the games. Traditionally, the use of t-SNE embeddings are for a single high-level representation of an agent. However, since our datasets are all from the Atari RAM representation, this enables us to make comparisons between different runs of an agent for the same algorithm and runs from Figure 3. Two-dimensional t-SNE embedding of Space Invaders gameplay collected from three games using the Rainbow agent, depicted in green, three games from the DQN agent, depicted in blue, and three games from the expert human, depicted in red, for nine games in total. To further identify any interesting clustering of the points, we selected a single game play from the two agents and the human data, so the t-SNE would show one from each for a total of three, instead of the aforementioned nine. The resulting t-SNE for this is shown in Figure 4, along with screenshots that are representative of the major clusters. We included screenshots for six clusters, two from each, that are darker in color corresponding to a higher score and being further along in the game. Since this depicts a later point in the game, any key moves or strategies are more visible since they have had time to be enacted. With a single game for each agent or human depicted, the representative clusters stand out even more. Figure 5. Two-dimensional t-SNE embedding of Seaquest gameplay collected from three games using the Rainbow agent, depicted in green, three games from the DQN agent, depicted in blue, and three games from the expert human, depicted in red, for nine games in total. As the resulting data appears to be fairly scattered for all nine games of Seaquest, we selected just the third play through for both agents and the humans and displayed it via t-SNE, shown in Figure 6. With just a single game from each source displaying, several clusterings became more apparent. The Rainbow agent has three distinct clusters, two of which overlap with the DQN agent. Screenshots from these two clusters depict the player unit, the yellow submarine, towards the center of the screen with no enemies around. The representative screenshots depicting the expert human data, via the red points, show the shit less towards the center and with more enemy units about. This suggests a potential difference in gameplay between the agents and the human, that we elaborate on in the following discussion. Figure 4. A t-SNE for a single game of Space Invaders from the green Rainbow agent, blue DQN agent, and red expert human with screenshots depicting the largest and high scoring clusters. 4.2 Seaquest Following the same steps of the Space Invaders data, we plotted the collected Seaquest gameplay data via t-SNE. Figure 5 depicts the t-SNE embedding of nine Seaquest games in total, three from each agent and three collected from the expert human. Similar to the Space Invaders t-SNE embedding, the two Rainbow and DQN agents, represented by green and blue respectively, overlap more with one another than the expert human data, represented by the red points. However, all three types of points about the diagram are much less clustered into groups and more spread out throughout the given range, indicating a greater variance of game states between the two agents and the expert human. One notable clustering resulting from all nines games is a grouping in the Figure 6. Representative screenshots for the various clusters center, where the Rainbow agent, shown in green, almost about the t-SNE embedding for the third play of Seaquest perfectly overlaps the DQN agent, shown in blue. For this cluster, from each agent and human dataset. Green points represent the points for both agents are also darker, indicating they are for the Rainbow agent, blue points for the DQN agent, and red higher scoring states that occur later during the game play. ones for human. 5. DISCUSSION Another analysis of the same t-SNE plot for a game of Space Plotting the game data via t-SNE provides a concise visualization Invaders provided insights into a distinct shooting strategy the two of such high-dimensional data. However, they are only beneficial agents and human each had. When we inspected clusters and to us if their clusterings detail any patterns that might be points that represented the game at a halfway point, where half the indicative of a strategy or interesting behaviour. One immediate enemy units on the screen were destroyed and the other half alive, clustering that caught our eye was for the t-SNE depicting a single we noticed an interesting pattern in the configuration of the game of Space Invaders for the two agents and humans. As Figure remaining enemies. As Figure 8 shows, the Rainbow and DQN 7 shows, there is a clear dark red cluster of expert human data agents target enemies either horizontally across the bottom or in a toward the center, indicating there are many similar game states diagonal pattern. However, the expert human destroys the enemy here and ones with a high game score comparatively. Examining units starting from the left column and working right. While each the points on this curving cluster, we noticed the screenshots of these represents a different shooting strategy for the given representing the game states at the time had a clear similarity. The player, the expert human’s strategy is debatably the most optimal. states in this cluster were for when a single enemy ship was left For Space Invaders, the enemy units move about the screen on the map, the point right before the player can advance to the horizontally, and once they reach the edge of the screen they next stage. It became clear that human player had difficulty hitting move down a single row, and continue moving the opposite the last few enemies, as they move fast and requires precise horizontal direction. This means that if there are few enemy aiming when there are not many left. A nearby cluster from the columns, it takes the enemies a greater time to horizontally Rainbow agent, represented in green and highlighted in Figure 7 traverse, allowing the player more time to fire at them. too, depicts a similar set of states. However, there are not as many There are trade-offs for this strategy though, as if the bottom row points for the agent in this set of states as there are the human, of the enemy units comes into contact with the ground, the game suggesting the agent can more accurately hit the fast moving last is over. This may be the reason why the deep reinforcement remaining enemies. learning trained agents shoot in a horizontal or diagonal pattern, While this is not a particular strategy, it does provide insights into so that they keep the bottom row higher up to avoid the game over similar difficulties both agent and human have in the game. It also condition, something they must have encountered quite often aligns with the maximum scores both agent and human achieved, during their early training phases. However, this does not translate as the agent spent less time on this phase and could advance to an optimal strategy as the expert human data reveals. Teaching through the game more rapidly, achieving a higher score in the a player the game using such agents could lead to the adoption of allotted time, which one might equate to expertise in this domain. this firing strategy, which would be suboptimal compared to that It is a case where the Rainbow agent is reflecting a difficulty also of the human’s. Such a case could also readily apply to more encountered by the human. If this was in the context of an educational game contexts, as an agent or tutor that learned to educational game, we could use the agent’s data to gain insights play or solve the problem may be doing so in a non-optimal way into where a hint or other feedback might be the most optimal, as compared to that of an expert human. Even though the “score” for it is a clear point of difficulty. Additionally, the DQN agent did a given game is greater, ultimately learning the better strategy not demonstrate such difficult. If just the DQN agent’s data was would have a greater pay off in the long run. used for such playtesting, this area of struggle may have been missed altogether. This insight was provided through a brief visual inspection enabled via t-SNE, that may not be as readily clear from parsing log data. Figure 8. The DQN and Rainbow agent cluster towards the bottom shows that half way through the game, they keep the enemies clustered. This can be observed from the human data, however instead of going across the bottom or diagonal, they Figure 7. A long clustering of red points, representing human start from the left most column and work right. data, with screenshots from the top and bottom, showing a Examining the t-SNE for a game of Seaquest also revealed pattern of the player attempting to destroy the last remaining insights into the differing navigational strategies used by both enemies. The nearby green points, for the Rainbow agent, agent and human. For this t-SNE plotting, there was less represents a similar clustering as the agent finishes the final clustering compared to the Space Invaders ones. However, as we enemy. investigated the different points and viewed the screenshots for representative states, a pattern with the player-controlled emerge in the visualizations and allow for some interesting submarine emerged. For both the DQN and Rainbow agent, the comparisons between the two. There are clear implications of submarine remained towards the bottom of the map and stayed using just an agent’s gameplay, as the enacted strategies may be centered on the y-axis, unless they were briefly moving to rescue optimal, but limiting to a user’s play. They also might a diver. However, the human points showed the submarine in a demonstrate a clear strategy, such as the firing configuration in variety of positions that were far from the center or bottom axis, Space Invaders, yet such a strategy could actually be sub-optimal even without the presence of these scuba divers. As Figure 9 for a human to enact. depicts, the human made use of more free moving navigational behavior, traversing the entire map and getting towards the edges While these two games are not traditional educational ones, the to allow themselves more time to position and fire at enemies. The implications of the techniques used and insights gained are still agents, who presumably had better accuracy from their mass applicable to ones in such a context. Eliciting strategies, training, could remain towards the center and only leave the regardless of coming from an AI system or human, is challenging bottom when they had to move upwards to fire at an enemy or and such visualizations provide one way to search for and surface for oxygen. understand them. At present, the use of agents using similar mechanisms and reinforcement learning methods to solve problems then instruct students agents [3] could benefit from the use of t-SNE visualization of the collected data. They want to ensure the strategies and suggested instruction are optimal, while remaining natural as a human would act. As it is not useful if a human cannot enact a particular suggested strategy, due to an agent having different control during the training process, such as access to frame-by-frame data in the game, causing it to have greater accuracy. 6. CONCLUSION & FUTURE WORK Our primary goal in this work is to explore expertise, in this case in the context of games. In such games, prior work often uses the score as a measure of how expert a player, either human or agent, is at the game. We believe in addition to the score, the strategies used to solve the game impact how expertise, in this domain, can be quantified. To gain insights into such strategies we visualized gameplay data of a high scoring and long time playing human, Figure 9. The red human data shows that they move about the deemed an expert, and high scoring agents gameplay data via screen more compared to the agents, depicted in blue and t-SNE. Analysis of the resulting t-SNEs yielded insights into both green, who most often end up in the center of the y-axis, shared and differing strategies the two parties had. Even between particularly towards the bottom. agents, there existed similar and dissimilar strategies, in addition to their score variance. Taking into account these gameplay Similarly to the Space Invaders strategy of creating different differences and how realistic an enacted strategy might be for a enemy configurations, the agents for Seaquest had their own human to learn from or mimick is important for game and tutor strategy that differed from the expert human gameplay. In this developers to keep in mind when using agents as playtesters or game’s context, there is not necessarily a clear benefit of one instructors. A strategy might be seem beneficial, yet compared to navigational strategy over the other. However, the one used by the a different one it may not be as optimal nor practical for a player agents might be better suited for a player who has better aim and or learner to utilize in their own gameplay. does not need to get their avatar close to the enemy units. If a human user learned from the actions of these agents though, they As we continue this work, we want to extend it to more games might not move around the map as the expert human gameplay other than Space Invaders and Seaquest, particularly ones in the did. While not necessarily impacting the score, it could impact educational space that also have accompanying agent-collected their enjoyment of the game, as they will be making less data. Further inspection remains to be done to draw more movements and have less control over their avatar, compared to strategies from the accompanying visualizations. Following this, treating it as basically fixed along the y-axis. This is another we will further look into how they cluster, indicating the consideration of using just an agent to playtest or learn from, as performance of similar strategies based on their policies. One key even when it might not impact performance, other factors like area we plan to explore is adding a temporal aspect to the t-SNE enjoyment might be impacted from the enactment of certain agent graphs. Although not represented in our current visualizations, we performed strategies. do have the screenshots numbered temporarily, so we expect that we can connect the paths to show the progression of game play. This research represents our initial exploratory work into Additionally, visualizing novice human data, in addition to the understanding expertise of complex tasks in open ended domains expert and agent data, could provide useful strategy comparisons. using a combination of human and artificial intelligence agents. This could help developers of educational games find where their We have begun by plotting expert human and human-level agents novice learners seem to struggle the most, from a visual using t-SNEs to provide a way for us to visualize data. We can see standpoint. from the plotted t-SNEs that expert human data does have some overlap with data from high performing DRL agents, however, 7. ACKNOWLEDGMENTS gaps exist where humans clusters are far away from the agent The research reported here was supported in part by a training data. Nevertheless strategies from both human and agent data grant from the Institute of Education Sciences (R305B150008). Opinions expressed do not represent the views of the U.S. arXiv:1705.10998.​ (2017). Department of Education. [17] Laird, J.E., Newell, A. and Rosenbloom, P.S. 1987. Soar: An architecture for general intelligence. ​Artificial intelligence.​ 8. REFERENCES 33, 1 (1987), 1–64. [1] Anderson, J.R., Matessa, M. and Lebiere, C. 1997. ACT-R: [18] LeCun, Y., Bengio, Y. and Hinton, G. 2015. Deep learning. A theory of higher level cognition and its relation to visual nature.​ 521, 7553 (2015), 436. attention. ​Human-Computer Interaction​. 12, 4 (1997), [19] LJPvd, M. and Hinton, G.E. 2008. Visualizing 439–462. high-dimensional data using t-SNE. ​Journal of Machine [2] Bellemare, M.G., Naddaf, Y., Veness, J. and Bowling, M. Learning Research​. 9, (2008), 2579–605. 2013. The arcade learning environment: An evaluation [20] Lyu, D., Yang, F., Liu, B. and Gustafson, S. 2018. SDRL: platform for general agents. ​Journal of Artificial Intelligence Interpretable and Data-efficient Deep Reinforcement Research​. 47, (2013), 253–279. Learning Leveraging Symbolic Planning. ​arXiv:1811.00090 [3] Chaplot, D.S., MacLellan, C., Salakhutdinov, R. and [cs]​. (Oct. 2018). Koedinger, K. 2018. Learning Cognitive Models Using [21] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Neural Networks. ​International Conference on Artificial Dean, J. 2013. Distributed representations of words and Intelligence in Education​ (2018), 43–56. phrases and their compositionality. ​Advances in neural [4] Chase, W.G. and Simon, H.A. 1973. Perception in chess. information processing systems​ (2013), 3111–3119. Cognitive psychology​. 4, 1 (1973), 55–81. [22] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, [5] Chi, M., VanLehn, K., Litman, D. and Jordan, P. 2011. An J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, evaluation of pedagogical tutorial tactics for a natural A.K. and Ostrovski, G. 2015. Human-level control through language tutoring system: A reinforcement learning deep reinforcement learning. ​Nature.​ 518, 7540 (2015), 529. approach. ​International Journal of Artificial Intelligence in [23] Moore, S. and Stamper, J. 2019. Decision Support for an Education​. 21, 1–2 (2011), 83–113. Adversarial Game Environment Using Automatic Hint [6] Clark, R.E. and Estes, F. 1996. Cognitive task analysis for Generation. ​International Conference on Intelligent Tutoring training. ​International Journal of Educational Research​. 25, Systems​ (2019), 82–88. 5 (1996), 403–417. [24] Rafferty, A.N., Brunskill, E., Griffiths, T.L. and Shafto, P. [7] Corbett, A.T. and Anderson, J.R. 1994. Knowledge tracing: 2011. Faster teaching by POMDP planning. ​International Modeling the acquisition of procedural knowledge. ​User Conference on Artificial Intelligence in Education​ (2011), modeling and user-adapted interaction​. 4, 4 (1994), 280–287. 253–278. [25] Stamper, J. and Barnes, T. 2009. Unsupervised MDP Value [8] Croteau, E.A., Heffernan, N.T. and Koedinger, K.R. 2004. Selection for Automating ITS Capabilities. ​International Why are algebra word problems difficult? Using tutorial log Working Group on Educational Data Mining.​ (2009). files and the power law of learning to select the best fitting [26] Stamper, J., Barnes, T. and Croy, M. 2011. Enhancing the cognitive model. ​International Conference on Intelligent automatic generation of hints with expert seeding. Tutoring Systems​ (2004), 240–250. International Journal of Artificial Intelligence in Education.​ [9] Ericsson, K.A., Prietula, M.J. and Cokely, E.T. 2007. The 21, 1–2 (2011), 153–167. making of an expert. ​Harvard business review​. 85, 7/8 [27] Stamper, J. and Moore, S. 2019. Exploring Teachable (2007), 114. Humans and Teachable Agents: Human Strategies versus [10] Ericsson, K.A. and Smith, J. 1991. ​Toward a general theory Agent Policies and the Basis of Expertise. ​International of expertise: Prospects and limits​. Cambridge University Conference on Artificial Intelligence in Education​ (2019). Press. [28] Such, F.P., Madhavan, V., Liu, R., Wang, R., Castro, P.S., [11] Glaser, R., Chi, M.T. and Farr, M.J. 1985. ​The nature of Li, Y., Schubert, L., Bellemare, M., Clune, J. and Lehman, J. expertise.​ National Center for Research in Vocational 2018. An atari model zoo for analyzing, visualizing, and Education Columbus, OH. comparing deep reinforcement learning agents. ​arXiv [12] Greydanus, S., Koul, A., Dodge, J. and Fern, A. 2017. preprint arXiv:1812.07069​. (2018). Visualizing and Understanding Atari Agents. [29] Sutton, R.S. and Barto, A.G. 2018. ​Reinforcement learning: arXiv:1711.00138 [cs].​ (Oct. 2017). An introduction​. MIT press. [13] Guckelsberger, C., Salge, C., Gow, J. and Cairns, P. 2017. [30] Van Der Maaten, L. 2014. Accelerating t-SNE using Predicting Player Experience Without the Player.: An tree-based algorithms. ​The Journal of Machine Learning Exploratory Study. ​Proceedings of the Annual Symposium Research​. 15, 1 (2014), 3221–3245. on Computer-Human Interaction in Play​ (New York, NY, [31] Verma, A., Murali, V., Singh, R., Kohli, P. and Chaudhuri, USA, 2017), 305–315. S. 2018. Programmatically interpretable reinforcement [14] Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., learning. ​arXiv preprint arXiv:1804.02477​. (2018). Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. [32] Wallner, G. Play-Graph: A Methodology and Visualization and Silver, D. 2018. Rainbow: Combining improvements in Approach for the Analysis of Gameplay Data. 8. deep reinforcement learning. ​Thirty-Second AAAI [33] Wallner, G. and Kriglstein, S. 2012. A Spatiotemporal Conference on Artificial Intelligence​ (2018). Visualization Approach for the Analysis of Gameplay Data. [15] Kriglstein, S., Wallner, G. and Pohl, M. 2014. A User Study Proceedings of the SIGCHI Conference on Human Factors of Different Gameplay Visualizations. ​Proceedings of the in Computing Systems​ (New York, NY, USA, 2012), SIGCHI Conference on Human Factors in Computing 1115–1124. Systems​ (New York, NY, USA, 2014), 361–370. [34] Zahavy, T., Ben-Zrihem, N. and Mannor, S. 2016. Graying [16] Kurin, V., Nowozin, S., Hofmann, K., Beyer, L. and Leibe, the black box: Understanding dqns. ​International B. 2017. The atari grand challenge dataset. ​arXiv preprint Conference on Machine Learning​ (2016), 1899–1908.