Taking the Scenic Route: Automatic Exploration for Videogames Zeping Zhan, Batu Aytemiz, Adam M. Smith Design Reasoning Lab University of California, Santa Cruz {zzha50, baytemiz,amsmith}@ucsc.edu Abstract learning. Aytar et al. (2018) demonstrate how even a sin- gle externally-provided trace of high-reward behavior can be Machine playtesting tools and game moment search engines require exposure to the diversity of a game’s state space if generalized into a reusable behavior policy. We believe that they are to report on or index the most interesting moments algorithms explicitly designed to address the exploration of possible play. Meanwhile, mobile app distribution services problem (finding sequences of actions that reach interest- would like to quickly determine if a freshly-uploaded game ing moments by any means necessary) could complement is fit to be published. Having access to a semantic map of advances in reinforcement learning. reachable states in the game would enable efficient infer- We introduce the problem of automatically exploring a ence in these applications. However, human gameplay data game’s state space with the goal of producing a semantic is expensive to acquire relative to the coverage of a game map (useful for downstream inference tasks) on timescales that it provides. We show that off-the-shelf automatic explo- comparable to human playtesting efforts. This map should ration strategies can explore with an effectiveness compara- allow us to judge the similarity of moments and allow in- ble to human gameplay on the same timescale. We contribute generic methods for quantifying exploration quality as a func- ferences about how moments relate to one another. Au- tion of time and demonstrate our metric on several elementary tomatic exploration is related to the topic of intrinsically- techniques and human players on a collection of commercial motivated play in reinforcement learning (Bellemare et al. games sampled from multiple game platforms (from Atari 2016). However, we are interested in the large collection of 2600 to Nintendo 64). Emphasizing the diversity of states moments found via exploration rather than training a behav- reached and the semantic map extracted, this work makes pro- ior policy. In particular, we are most interested in uncover- ductive contrast with the focus on finding a behavior policy or ing automatic exploration techniques that are plausibly ap- optimizing game score used in most automatic game playing plicable to analysis of contemporary mobile games (native- research. compiled software binaries for 64-bit computing platforms To know what’s inside of a videogame, you need to do more which draw user interfaces with low-level code) on short than read source code or browse asset data. Looking over a timescales (minutes rather than days of in-game experience). player’s shoulder can give you one view of a game’s space In this paper, we make the following contributions: of possible interactions, and running an artificial intelligence • We define several exploration quality metrics, some of (AI) algorithm to find score-optimizing behavior can give which make use of game-specific knowledge and others you another. Neither, however, is trying to play in a way that that are comparable across games and platforms. intentionally covers the most ground. • We offer a methodology for comparing human and ma- Access to a large and diverse sample of trajectories chine exploration efficiency as a function of time. through a game’s state space could enable many applications • We identify several elementary automatic exploration relevant to game designers, players, scholars, educators, and strategies and demonstrate them on a range of platforms. distributors. We might automatically map reachable zones of a game’s world (Osborn, Summerville, and Mateas 2017b; • Our experimental results highlight basic gameplay com- Bauer and Popovic 2012) and visualize how these change in petency for elementary methods, the ability of our met- response to design modifications. We might index the mo- rics to identify distinct scenes, the ability to bootstrap per- ments demonstrated and return them in response to visual ceptual models with self-play, and the strength variation queries (Zhang et al. 2018), enabling users of a search en- for exploration techniques across games and across game gine to bookmark and share references to significant events platforms spanning from Atari 2600 to Nintendo 64. (Kaltman et al. 2017). We might even identify when a game freshly uploaded to an app store should be flagged for re- Background moval because it behaves maliciously only after several min- This section reviews existing techniques for gameplay utes of interaction (Shen, Chien, and Hung 2014). knowledge extraction, automated testing of mobile apps, and Meanwhile, in automated gameplay via reinforcement existing techniques in AI for extrinsically and intrinsically learning (RL), the ability to explore is critical for efficient motivated automatic play. Gameplay Knowledge Extraction based on parsing graphical user interfaces assembled from Our project is situated with the larger effort of automated platform-provided building blocks (Amalfitano et al. 2012). game design learning (AGDL) (Osborn, Summerville, and However, these techniques break down for many games Mateas 2017a) a research paradigm that encourages under- that present custom interfaces drawn with low-level libraries standing games by directly interacting with them (rather such as OpenGL ES. Similarly, techniques based on static than, e.g., source code analysis). In particular, our work is a analysis of bytecode (Feng et al. 2014) break down when direct response to a previous paper in the Knowledge Extrac- native machine code is used (e.g. in optimized rendering tion from Games workshop on representing and retrieving or game physics code). By focusing our exploration of game states with moment vectors (Zhan and Smith 2018). videogames at the level of their display pixels and low-level The automatic exploration strategies we offer in this paper input events, we adopt a perspective that works across many are a source of gameplay trajectories (screenshots, actions, more game platforms. and memory snapshots) that complements the use of human gameplay data in that work. Extrinsically Motivated Automatic Play Moment vectors for interactive media are a sub-symbolic In most classic applications of AI in games (famously in knowledge representation similar to that of word vectors Chess and Go), the goal is to win the match or attain the in natural language processing (Mikolov, Yih, and Zweig optimal score. Monte Carlo Tree Search (MCTS) is one 2013). Although word vectors can be of direct use in text surprisingly simple and effective strategy for finding a se- retrieval applications, they can easily support more complex quence of moves in a modeled environment that approxi- tasks that rely on the ability to make inferences from a text’s mately optimizes this score (Browne et al. 2012) and has semantic content (Joulin et al. 2016). already been applied to strategy analysis in games (Zook, Zhang et al. (2018) describe a visual search engine for Harrison, and Riedl 2015). These approaches all require a game moments that is based on moment vectors derived forward model (or simulator) that can list available actions, from various gameplay sources. These authors also note a identify terminal states (with their scores), and (un-)apply curious property of moment vectors that we think would actions in support of search. While constructing a fairly ac- make them useful for inference in tasks far beyond retrieval: curate forward model for the Atari 2600 using an existing they support reasoning by analogy. Similar to linear alge- emulator implementation like Stella1 is straightforward, the bra analogies in word vector representation, moment vec- larger memory sizes of recent platforms (e.g. Android) com- tors learned with the Pix2Mem strategy appear to repre- plicate snapshot-and-restore. sent game-specific knowledge such as which power-ups the Model-free reinforcement learning techniques are capable player has collected or what their location is within the larger of learning to play videogames without access to this kind of game world. simulator. Playing without the ability to take back an action, AGDL systems can only learn from the parts of a game Deep Q-Networks were recently shown (Mnih et al. 2015) they actually experience. Missing from previous work in this to learn super-human play styles for some Atari games (af- domain is any way of quantifying coverage of a game’s con- ter more than 38 days of simulated gameplay experience). tent. Although it is difficult to define what it means to see These results are impressive, but we seek broad state cover- all or even enough of a game’s content, we can still make age for much more recent games in much less time. Drop- progress by judging when one exploration method has seen ping the requirement of learning a search-free action policy more than another. For the first time, the exploration quality allows us to operate under much shorter timescales (minutes metrics contributed in this paper will quantify this level of rather than days). For our applications, search is an entirely exposure. acceptable strategy for reaching new states. App Testing for Mobile Markets Intrinsically Motivated Automatic Play Mobile app developers continually submit new software for Montezuma’s Revenge is an Atari game that stands out as distribution on services like Apple’s App Store or Google particularly difficult for many deep reinforcement learning Play, yielding many thousands of new apps per day. To bal- algorithms because the rewards that motivate play (derived ance the need for quality control with the financial incentive from game score increases) occur so sparsely. In response, to bring new apps to market in a timely manner, distributors researchers are starting to add extra rewards representing in- are pressured to judge whether an app should be published trinsic motivations (such as curiosity about infrequently vis- very quickly. In response, scalable techniques have been de- ited states) to existing approaches (Bellemare et al. 2016). vised to detect malware or trivial wrappings or repackagings Although this might seem to distract from optimizing score, of existing apps using just a few seconds of simulated inter- it turns out to help agents discover new rewarding paths. action (Chen et al. 2015). While this is useful for basic qual- Further work has shown that exclusively using curiosity also ity control, we believe more interaction is required to gather leads to favorable results (Burda et al. 2018). In each of these enough data to support content-based retrieval (Zhang et al. projects, however, the primary metric used to argue for the 2018) for apps and the moments they contain. effectiveness of a method is to report the score associated Either by revenue or by popularity, games stand out as a with the existing reward metric. significant category of mobile apps. Some automated app 1 testing techniques can guess at meaningful actions to try https://stella-emu.github.io/ Intrinsic motivation for exploration is not a new topic, Python. Depending on the complexity of the emulated plat- even within applications to games (Merrick and Ma- form and the exploration strategy, one second of simulated her 2009). Other work in AI is specifically oriented gameplay may take more or less than one second of wall- towards mapping reachable spaces rather than learning clock time—exploration need not occur in real-time. ideal behavior policies. Rapidly Exploring Random Trees To define the space of possible actions for each platform, (RRT) (LaValle 1998) and Probabilistic Roadmaps (PRM) we recorded ourselves playing a small number of games for (Kavraki et al. 1996) are two such techniques originally in- each. Most of our automated exploration strategies select ac- vented for assisting in motion planning for robotics. Bauer tions (controller input states) only from those represented in et al. (2012) introduced RRT to the technical games research this dataset. community in an application to level design feedback. Iterative widening is another such exploration technique Exploration Strategies that has been applied to games. Previous work has shown Each of our exploration strategies interacts with games in that, when operating over a set of predefined Boolean pixel BizHawk over time. They may configure the controller state features, IW can achieve scores comparable to humans in (defining which buttons are pressed) and advance the logic almost real time when it comes to playing Atari 2600 games of the game. Strategies can decide whether any frame they (Bandres, Bonet, and Geffner 2018). Our work focuses RRT experience is included in their output collection of mo- over IW because RRT makes use of a continuous feature ments (represented as screenshot and memory state snapshot space, exactly the type learned by Pix2Mem. pairs). Strategies may save and load any number of snap- In evolutionary computation, novelty search is the shots they like. paradigm that eschews optimizing a given fitness function in favor of finding solutions that are different from those seen Attract Mode The first and simplest exploration strategy before. Techniques such as MAP-Elites (Mouret and Clune we consider is associated with the concept of an attract 2015) are explicitly designed to yield a large archive of solu- mode. Attract modes are a feature of many games derived tions with Quality Diversity (QD) (Pugh, Soros, and Stanley from arcade classics which played animations when idle to 2016). entice players to walk up and insert coins. Attract mode an- Rather than attempt to use intrinsically motivated rein- imations often show snippets of actual gameplay, profiles of forcement learning techniques directly as exploration strate- characters, or cut-scenes revealing pieces of the main plot. gies, our work focuses on using algorithms like RRT to reach In the case where gameplay is shown, the game is typically new moments in the game and understand how they relate executing almost all the same code as in interactive play and to one another. As in QD work, the emphasis is on the re- a stored recording of representative play styles is seen. For sulting archive. For applications where training a policy is knowledge extraction purposes, these demonstrations are as the desired goal, archives of interesting moments found by valid as those uncovered by interactive play. unrelated exploration strategies might provide a critical effi- Our attract mode strategy sets the no-buttons-pressed con- ciency boost for learning (Aytar et al. 2018). troller state and simply begins to advance game time by half a second each step. Although this strategy yields very inter- esting demonstrations of expert play and late-game content Game Interface for some games, for many others it leaves the game sitting Rather than directly taking on the full complexity of a mo- at an uninteresting screen that a human player would have bile game platform like Android, we consider a sequence of quickly dismissed. simpler game platforms. We start with Atari 2600 (a second- Human Volunteers Our next strategy reliably yields sam- generation game console), the platform commonly used in ples of core gameplay behavior. We ask human volunteers recent deep reinforcement learning experiments. We end at to play the game while recording their input stream as a Nintendo 64 (fifth-generation), a platform that modestly un- BizHawk movie. These BK2 files are a common medium dershoots the technical specifications of low-end Android for sharing interaction traces in the TAS community. In our devices. The feature most distinguishing the platforms we experiments below, human data comes from players who consider from Android (aside from touch-screen controls) is have usually never played the specific game before. In some the lack of networking capabilities. For better or for worse, cases, our volunteers were not familiar with the language many Android games are actually parts of larger distributed of the text on the screen. For certain games in our collec- systems. This means that we can save and restore game tion, there are numerous movies of expert play available for states for earlier game contents (as required by MCTS-style download from the TAS community.3 For others, our team’s forward models) if we are willing to pay the time and space first look at a game came from browsing the results of attract cost of doing so. For Android, however, snapshotting the mode exploration. state of an active network connection would be unrealistic. To harvest a collection of moments from these pre- To interface with this broad range of game platforms, we recorded movies, we play them back through BizHawk selected the BizHawk2 emulator. Maintained by the tool- keeping frames spaced half a second apart as in the at- assisted speedrunning (TAS) community, BizHawk offers tract mode strategy. Our volunteers did not make use of an extensive scripting interface which we can access via BizHawk’s save/load features to explore more thoroughly, 2 3 http://tasvideos.org/Bizhawk.html http://tasvideos.org/Movies.html and they were not given any specific instructions about what to the goal. The resulting action and state form a new tree play style to use. edge. This strategy makes extensive use of the ability to save Chaos Monkey Our volunteers noted that often random and load game snapshots. To amortize the cost, the actions button mashing was good enough to stumble through menus available to our RRT strategy operate over many frames. If they could not read. Our chaos monkey strategy is inspired snapshot loading were not available, states could be recre- by the “UI/Application Exerciser Monkey” distributed with ated by replaying actions along their path from the root the Android developer tools.4 This tool injects a pseudo- node. In most applications of RRT, the selection of action random stream of input events (e.g. touch-screen actions) to take next is parameterized by both the current state and and system-level events (e.g. incoming phone calls) into an the goal location. However, as in the chaos monkey strat- app under test in an effort to cause it to crash. egy, our baseline RRT strategy selects a random human- Our chaos monkey exploration strategy samples con- demonstrated action and holds it for a half-second. troller states according to a platform-wide distribution of controller states fit to our human gameplay collections. Ac- In our work RRT functions as both a source of data for tions are selected independent of the game, the contents of knowledge extraction (producing screenshots and memory the display, and of previous actions. Similar to the attract states on which to train) as well as an application for ex- mode strategy, we extract moments every half-second, hold- tracted knowledge (distances between moments in the game ing the controller state between steps. are judged by the similarity of their moment vectors). An Many of our games require the player to hit the con- experiment described later in this paper looks at the mutual troller’s START button at least once during traversal of early benefit between Pix2Mem representation learning and RRT game menus while not requiring this button later in play. exploration. As a result, our baseline chaos monkey strategy sometimes Meta Strategies The strategies introduced above offer even triggers attract mode animations for failure to select many opportunities for local improvements. Without im- this very rarely used input configuration in time. Certainly, proving any individual strategy, we want to highlight how there is much room for improvement. elementary strategies might be synergistically combined. Rapidly-Exploring Random Trees Drawing on state- When multiple strategies are applied to the same game, we space exploration algorithms developed for robotics, this ask what one can use of another’s exploration results. strategy uses the Rapidly-Exploring Random Trees algo- Rather than always starting from the same boot state, al- rithm (LaValle 1998). RRT is designed for exploring con- gorithmic strategies can be used to create branches off of hu- tinuous state spaces of moderately-low dimensionality. In man volunteer data. Similarly, RRT can be modestly adapted an application to robot arm movement, the state of the arm to produce rapidly-exploring random forests that branch off might be captured by joint angles. The ground-truth state of state-space bookmarks placed by other exploration strate- of our videogames, however, spans a very high-dimensional gies. Rather than choosing actions independent of current discrete space:5 the possible configuration of each byte of game/state/goal, data from other strategies might be used to main memory and other subsystems. Even the space of dis- train a predictive model that selects a more relevant action by play pixels is another high-dimensional discrete space. In looking at the moment vectors of the current and goal states. response, we project screenshots into 256-dimensional mo- One hybrid branching strategy is experimentally examined ment vectors using the Pix2Mem deep convolutional net- later in this paper. work strategy (Zhan and Smith 2018). Pix2Mem is a repre- Human volunteers, with access to a visualization of mo- sentation learning strategy in which the contents of memory ments extracted by other strategies, might be able to find are predicted from screenshot pixels. The bottleneck layer more interesting moments from which to start their own of this network serves as a compact, semantic representation gameplay. This visualization might take the form of a tSNE of what is happening in the screenshot. (Maaten and Hinton 2008) plot of previously-seen moments RRT works by growing a tree that captures how to reach (making use of the rich moment vector representation). Hu- points of the game’s state space from some initial state (for mans and algorithms together might be able to reach corners us, the moment the platform has booted with a game car- of the space that neither would have explored on their own tridge installed). Nodes in the tree are previously seen states, in the same amount of total gameplay time. and edges (annotated with action sequences) describe how to get from one state to another. The algorithm continually Exploration Metrics picks a random goal location in the continuous moment vec- How should we judge whether an exploration strategy is tor space, finds which existing tree node is closest to it, and continuing to make progress or if one is making progress performs an action from that state in an attempt to get closer more efficiently than another? We would like to be able to 4 directly observe how much territory is covered by a strategy https://developer.android.com/studio/ test/monkey by consulting a map. In our work, the space of rich, semantic 5 Even though algorithms like IW can naturally operate on dis- moment vectors functions as this map. crete representations, these algorithms would not make use of the By contrast to reinforcement learning research, we should knowledge implicitly represented in moment vectors to judge the emphasize that our goal is to assess the diversity of moments significance of observed differences in display pixels or memory an exploration strategy has extracted, not to judge its game- bytes. play behavior. Competent use of state saving/loading actions is highly relevant to exploration quality while not actually used in the RRT strategy above). Zhan and Smith (2018) representing any in-game behavior at all. showed that this representation was useful for identifying moments by scene as well as position within scene. This is Measuring the Spread of Points our game-specific embedding strategy. Our preliminary exploration metrics are built on intuitions To compare exploration across games and across game for how we might explore a low-dimensional, continuous platforms, we cannot assume we have pre-established game- space. Each moment extracted by an exploration strategy play datasets available to train instances of Pix2Mem. Con- represents a point somewhere in this space. Intuitively, ex- sider this game-independent three-dimensional embedding: ploration strategies should produce a cloud of points that is the average RGB color of display pixels. When characters or well spread out in this space. How should we measure this? other display elements move across colorful backgrounds by One strategy is to draw an axis-aligned box (or hyper- small amounts, the average color will only change a bit (if cube) around all of the points and then to report the area (or at all). Scene transitions, which might be marked by fades hypervolume) of the box. As new points are explored and and flashes followed by new screens with different domi- dropped into this space, the box can only grow when points nant colors, might be detected in this space. Three dimen- fall outside the previous bounds, representing increasingly sions is clearly too small to be useful (particularly consider- wider coverage. To account for degenerate boxes (which are ing what flashes of all-white and all-black would do under well spread in some dimensions but not at all in others) we the bounding box sum metric), but we can still recover the report the sum of the lengths of the sides of the box rather idea of a game-independent perceptual space. Recycling a than their product. This is our bounding box sum metric. specific6 pre-trained neural network (the Inception-V3 net- Another strategy pays less attention to extreme points and work trained to convergence on ImageNet), we can embed more to the distribution of points within the cloud. We pro- screenshots into a 1000-dimensional space. Even though the pose to fit a (likely highly anisotropic) multi-dimensional output dimensions from this network are unrelated to the ob- Gaussian distribution to the points. The directions of max- jects appearing in our videogame screenshots, it still seems imal variation (derivable from the covariance matrix of the to work in our experiments below. The all-white-all-black point data) play a similar role to the sides of the bounding catastrophe for the bounding box sum metric, in particular, box above, with variance being analogous to hypervolume. cannot occur with this embedding because the elements of Again, to account for potentially degenerate distributions the embedded vector are constrained to always add up to (for which the data is spread out only in a lower-dimensional one. This is our game-independent embedding strategy. subspace), we simply sum the eigenvalues of the covariance Fig. 1 compares our two spread metrics with two embed- matrix (finding its nuclear norm) rather than forming their ding functions for a run of the human volunteer exploration product (the determinant). This is our nuclear norm metric. strategy. Using the game-specific embedding (Pix2Mem By contrast with the bounding box sum metric, the nu- trained on data from a longer human play session), each clear norm metric may decrease as new points are explored key moment (identified by visual inspection of the game- if those points are relatively more concentrated than the ini- play video) is associated with kinks in the graph for both tial set of points. This can happen for exploration strategies spread metrics. For the game-independent (Inception) em- that spend a long time in one game mode or screen after ini- bedding, broad trends are somewhat preserved. The nuclear tially traversing menus with rich visual variation. norm metric temporarily decreases for the game-specific embedding but not the game-independent embedding be- Embedding Game Moments into a Space cause the game-specific embedding knows our human player Given that we can evaluate the spread of points in some is still focused on one specific level for a long time while space, how do we embed explored game moments into that the game-independent metric is presumably more sensitive space in the first place? A useful embedding function should to background-art details that are scrolling by as the level place similar moments nearby one another and dissimilar progresses. moments far apart if our intuitions about the spread of points are to be effective. Experiments Consider a game-specific embedding function for a game This section describes experiments applying our exploration like Super Mario World for the Super Nintendo Enter- quality metrics to our elementary exploration strategies. tainment System. In one three-dimensional embedding, we might use one dimension to represent the game scene (an or- Core Gameplay Coverage dered index of game menus and then sequential level iden- Can our automated exploration strategies make progress tifiers) and two more dimensions to represent the position comparable to our human volunteers? of the player character within that scene. As an exploration Fig. 2 visualizes the game-specific nuclear norm metric strategy completes one scene and moves to the next, points as a function of time for four exploration strategies. The on a new plane are discovered and discontinuous movement attract mode strategy never starts core gameplay, however is represented in the positional dimensions. Although the it does experience a short recording of example gameplay truth of which scene we are in and where our character is in in the game’s attract mode looping behind the main menu. the scene could be computed from memory bytes, we con- sider a more general family of game-specific embeddings. In 6 https://keras.io/applications/ particular, we reuse the Pix2Mem embedding strategy (also #inceptionv3 timescale that would be highly impractical for current deep reinforcement learning approaches. This experiment anchors our exploration metrics in a qualitative analysis of the modes and levels of one well- known game. In the remaining experiments, we do not per- form a qualitative analysis as even the authors are not suffi- ciently familiar with the games in question to make a clear statement about what level of exploration would be enough for any given application. Bootstrapping Perception The experiment above executed RRT using a fixed embed- ding based on Pix2Mem trained on previous human game- play for that specific game. Without falling back to a game- independent embedding (which would be reasonable any- way), can automatic exploration data itself be used as a source of training data for RRT’s visual perception model? In this experiment, we consider running RRT in the same Figure 1: Comparison of exploration quality metrics for a 5- game for 15 minutes at a time using different screenshot minute run of human volunteer data in Super Mario World. embedding functions. Our network starts untrained (weights The marked moments represent approximately when (A) the randomly initialized). Although this embedding function has main menu fades in, (B) the overworld map fades in, (C) had no experience of the game under test (or any other game) first core (platformer) gameplay begins, and (D) first core it still supports automatic exploration, albeit at reduced ef- gameplay completes. ficiency. After the time is up, we incrementally train our network on the results of exploration run and discard the tree structure. We repeat this process of exploring and re- training several more times. Even though the final run of the algorithm still experiences only the same amount of to- tal gameplay time and starts from the same initial state, it makes better progress through the game as a result of a bet- ter (game-specific) visual perception model. Fig. 3 shows the Inception7 nuclear norm metric for four runs of RRT. Notice how more perceptual experience allows the fixed algorithm to explore both faster and deeper. The fact that the bulk of the gains are made after just the first 15 minutes of gameplay experience is promising for the use of game-specific embeddings even for games no human re- Figure 2: Exploration progress (by Pix2Mem nuclear norm viewers or testers have yet played (as in the app testing sce- metric) on Super Mario World by various strategies. nario). Cooperation between Strategies Chaos monkey manages to start core gameplay almost as On a fixed budget, how should resources be allocated be- fast as the human volunteer, however it cannot sustain the tween different exploration strategies? Fig. 4 compares diversity of human-demonstrated moments. RRT starts slow equal-time applications of chaos monkey and RRT with a (often deciding to explore many moments selected from hybrid strategy that switches from chaos monkey to RRT at the animated transitions between scenes during which the the midpoint. The hybrid uses 100 randomly selected chaos player has no meaningful control) but eventually surpasses monkey results to seed a forest of trees. As above, we show the spread from the other methods. Trends indicate that RRT the Inception nuclear norm metric. Surprisingly, the hybrid would benefit from being able to explore for longer while the strategy directly benefits from the quick start of chaos mon- other strategies (including our human volunteer, one of the key while retaining the growth trend from RRT. The hybrid authors with imperfect gameplay skill) would not. strategy covers more ground in equal time. In this case, only In a qualitative analysis of the screenshots extracted dur- the set of seed states for exploration has been recycled, not a ing exploration, we found that RRT could reach core game- learned knowledge representation trained from chaos mon- play in Super Mario World within two minutes when started key’s experience. In the previous experiment, we preserved from the game’s boot state. When started from the first mo- the perceptual knowledge and not the concrete seed states. ment of core gameplay, RRT could finish the level within five minutes (less than 600 tree-expansion steps). Results 7 We used a game-independent metric for this experiment to like this encourage us to believe automatic exploration avoid any confusion with metrics that might themselves be based could be very useful for testing modern mobile games on a on exploring this specific game Figure 3: RRT exploration progress (by Inception nuclear norm metric) for Super Mario World using self-play to train the perceptual embedding network. Figure 5: Exploration progress after 10 minutes of simulated gameplay (by Inception nuclear norm metric) across games. Performance across Games and Platforms Our final experiment compares the performance of our ele- mentary exploration strategies across a number of commer- cial games selected from different BizHawk-supported game platforms. Because we use the same metric for each game (Inception nuclear norm after 10 minutes of simulated game- play), scores may be compared across games and platforms. However, the range of scores depends both on game design issues (such as the degree to which the Inception network distinguishes visual variety in the game’s display and how much variety can actually be experienced in just 10 minutes) and the utility of the exploration strategy. For randomized exploration strategies (chaos monkey and RRT), we average the score for 10 exploration runs. Fig. 5 samples games from Atari 2600 (128 bytes of mem- Figure 4: Exploration progress (by Inception nuclear norm ory), Game Boy (8 KiB), Super Nintendo (128 KiB), and metric) for 30 total minutes of Super Mario World gameplay Nintendo 64 (4 MiB). Notably, it is possible to surpass hu- using chaos monkey, RRT, and a combination where RRT man exploration quality in some games, and no single strat- explores from points first found by chaos monkey. egy consistently out/under-performs others. From our boot- strapped perception experiment above, we know RRT has a slow start, particularly on the 10-minute scale and with a poor perceptual embedding (it uses the 1000-dimensional Inception embedding space in this experiment). We find these results encouraging for the possibility that modest im- provements to our baseline strategies could yield human- comparable exploration on similar timescales. Conclusion through static analysis. In Proceedings of the 22Nd ACM We introduce the problem of automatically exploring a SIGSOFT International Symposium on Foundations of Soft- game’s state space with the goal of extracting a useful se- ware Engineering, FSE 2014, 576–587. New York, NY, mantic map on timescales comparable to human playtesting USA: ACM. efforts. We introduce elementary exploration strategies and Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T. demonstrate their application on a array of games and game 2016. Bag of tricks for efficient text classification. CoRR platforms. To quantify exploration progress, we define four abs/1607.01759. quality metrics that reveal different aspects of exploration Kaltman, E.; Osborn, J.; Wardrip-Fruin, N.; and Mateas, M. patterns while exploiting pre-existing reference data when 2017. Getting the GISST: A toolkit for the creation, analysis available. To compare exploration efficiency with human and reference of game studies resources. In Proceedings playtesters, we propose to compare quality metrics as a func- of the 12th International Conference on the Foundations of tion of the total gameplay time used in exploration. Experi- Digital Games, FDG ’17, 16:1–16:10. New York, NY, USA: mentally, we demonstrate that with a very modest amount of ACM. gameplay time, automatic exploration strategies can reach Kavraki, L. E.; Svestka, P.; Latombe, J.-C.; and Overmars, interesting gameplay moments (such as the completion of M. H. 1996. Probabilistic roadmaps for path planning in the first level in Super Mario World), bootstrap their own high-dimensional configuration spaces. IEEE transactions game-specific perception models, and leverage data pro- on Robotics and Automation 12(4):566–580. duced by other exploration methods. Finally, we offer the first comparison of exploration progress across multiple LaValle, S. M. 1998. Rapidly-exploring random trees: A games and platforms using a single platform-independent new tool for path planning. Technical Report TR 98-11, metric. Computer Science Department, Iowa State University. Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using References t-sne. Journal of machine learning research 9(Nov):2579– Amalfitano, D.; Fasolino, A. R.; Tramontana, P.; 2605. De Carmine, S.; and Memon, A. M. 2012. Using Merrick, K. E., and Maher, M. L. 2009. Motivated rein- GUI ripping for automated testing of android applications. forcement learning: curious characters for multiuser games. In Proceedings of the 27th IEEE/ACM International Con- Springer Science & Business Media. ference on Automated Software Engineering, 258–261. Mikolov, T.; Yih, W.-t.; and Zweig, G. 2013. Linguistic ACM. regularities in continuous space word representations. In Aytar, Y.; Pfaff, T.; Budden, D.; Paine, T. L.; Wang, Z.; and Proceedings of the 2013 Conference of the North American de Freitas, N. 2018. Playing hard exploration games by Chapter of the Association for Computational Linguistics: watching youtube. CoRR abs/1805.11592. Human Language Technologies, 746–751. Bandres, W.; Bonet, B.; and Geffner, H. 2018. Planning Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve- with pixels in (almost) real time. CoRR abs/1801.03354. ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Bauer, A. W., and Popovic, Z. 2012. RRT-based game level Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human- analysis, visualization, and visual refinement. In Proceed- level control through deep reinforcement learning. Nature ings of the AAAI Conference on Artificial Intelligence in In- 518(7540):529. teractive Entertainment. Mouret, J.-B., and Clune, J. 2015. Illuminating search Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; spaces by mapping elites. arXiv preprint arXiv:1504.04909. Saxton, D.; and Munos, R. 2016. Unifying count-based Osborn, J. C.; Summerville, A.; and Mateas, M. 2017a. Au- exploration and intrinsic motivation. In Advances in Neural tomated game design learning. In 2017 IEEE Conference on Information Processing Systems, 1471–1479. Computational Intelligence and Games (CIG), 240–247. Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Osborn, J.; Summerville, A.; and Mateas, M. 2017b. Auto- Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; matic mapping of NES games with mappy. In FDG ’17 Pro- Samothrakis, S.; and Colton, S. 2012. A survey of Monte ceedings of the 12th International Conference on the Foun- Carlo tree search methods. IEEE Transactions on Computa- dations of Digital Games. ACM. tional Intelligence and AI in Games 4(1):1–43. Pugh, J. K.; Soros, L. B.; and Stanley, K. O. 2016. Qual- Burda, Y.; Edwards, H.; Pathak, D.; Storkey, A.; Darrell, T.; ity diversity: A new frontier for evolutionary computation. and Efros, A. A. 2018. Large-scale study of curiosity-driven Frontiers in Robotics and AI 3:40. learning. In arXiv:1808.04355. Shen, Y.-C.; Chien, R.; and Hung, S.-H. 2014. Toward effi- Chen, K.; Wang, P.; Lee, Y.; Wang, X.; Zhang, N.; Huang, cient dynamic analysis and testing for Android malware. IT H.; Zou, W.; and Liu, P. 2015. Finding unknown malice in CoNvergence PRActice (INPRA) 2(3):14–23. 10 seconds: Mass vetting for new threats at the Google-Play Zhan, Z., and Smith, A. M. 2018. Retrieving game states scale. In 24th USENIX Security Symposium (USENIX Secu- with moment vectors. In Proceedings of the AAAI 2018 rity 15), 659–674. Washington, D.C.: USENIX Association. Workshop on Knowledge Extraction from Games. Feng, Y.; Anand, S.; Dillig, I.; and Aiken, A. 2014. Ap- Zhang, X.; Zhan, Z.; Holtz, M.; and Smith, A. M. 2018. poscopy: Semantics-based detection of Android malware Crawling, indexing, and retrieving moments in videogames. In FDG ’18 Proceedings of the International Conference on the Foundations of Digital Games. ACM. Zook, A.; Harrison, B.; and Riedl, M. 2015. Monte-carlo tree search for simulation-based play strategy analysis. In Proceedings of the 10th International Conference on the Foundations of Digital Games, FDG 2015, Pacific Grove, CA, USA, June 22-25, 2015.