Taking the Scenic Route: Automatic Exploration for Videogames

                                      Zeping Zhan, Batu Aytemiz, Adam M. Smith
                                                        Design Reasoning Lab
                                                  University of California, Santa Cruz
                                                {zzha50, baytemiz,amsmith}@ucsc.edu


                            Abstract                                  learning. Aytar et al. (2018) demonstrate how even a sin-
                                                                      gle externally-provided trace of high-reward behavior can be
  Machine playtesting tools and game moment search engines
  require exposure to the diversity of a game’s state space if        generalized into a reusable behavior policy. We believe that
  they are to report on or index the most interesting moments         algorithms explicitly designed to address the exploration
  of possible play. Meanwhile, mobile app distribution services       problem (finding sequences of actions that reach interest-
  would like to quickly determine if a freshly-uploaded game          ing moments by any means necessary) could complement
  is fit to be published. Having access to a semantic map of          advances in reinforcement learning.
  reachable states in the game would enable efficient infer-             We introduce the problem of automatically exploring a
  ence in these applications. However, human gameplay data            game’s state space with the goal of producing a semantic
  is expensive to acquire relative to the coverage of a game          map (useful for downstream inference tasks) on timescales
  that it provides. We show that off-the-shelf automatic explo-       comparable to human playtesting efforts. This map should
  ration strategies can explore with an effectiveness compara-
                                                                      allow us to judge the similarity of moments and allow in-
  ble to human gameplay on the same timescale. We contribute
  generic methods for quantifying exploration quality as a func-      ferences about how moments relate to one another. Au-
  tion of time and demonstrate our metric on several elementary       tomatic exploration is related to the topic of intrinsically-
  techniques and human players on a collection of commercial          motivated play in reinforcement learning (Bellemare et al.
  games sampled from multiple game platforms (from Atari              2016). However, we are interested in the large collection of
  2600 to Nintendo 64). Emphasizing the diversity of states           moments found via exploration rather than training a behav-
  reached and the semantic map extracted, this work makes pro-        ior policy. In particular, we are most interested in uncover-
  ductive contrast with the focus on finding a behavior policy or     ing automatic exploration techniques that are plausibly ap-
  optimizing game score used in most automatic game playing           plicable to analysis of contemporary mobile games (native-
  research.                                                           compiled software binaries for 64-bit computing platforms
To know what’s inside of a videogame, you need to do more             which draw user interfaces with low-level code) on short
than read source code or browse asset data. Looking over a            timescales (minutes rather than days of in-game experience).
player’s shoulder can give you one view of a game’s space                In this paper, we make the following contributions:
of possible interactions, and running an artificial intelligence      • We define several exploration quality metrics, some of
(AI) algorithm to find score-optimizing behavior can give                which make use of game-specific knowledge and others
you another. Neither, however, is trying to play in a way that           that are comparable across games and platforms.
intentionally covers the most ground.                                 • We offer a methodology for comparing human and ma-
   Access to a large and diverse sample of trajectories                  chine exploration efficiency as a function of time.
through a game’s state space could enable many applications           • We identify several elementary automatic exploration
relevant to game designers, players, scholars, educators, and            strategies and demonstrate them on a range of platforms.
distributors. We might automatically map reachable zones of
a game’s world (Osborn, Summerville, and Mateas 2017b;                • Our experimental results highlight basic gameplay com-
Bauer and Popovic 2012) and visualize how these change in                petency for elementary methods, the ability of our met-
response to design modifications. We might index the mo-                 rics to identify distinct scenes, the ability to bootstrap per-
ments demonstrated and return them in response to visual                 ceptual models with self-play, and the strength variation
queries (Zhang et al. 2018), enabling users of a search en-              for exploration techniques across games and across game
gine to bookmark and share references to significant events              platforms spanning from Atari 2600 to Nintendo 64.
(Kaltman et al. 2017). We might even identify when a game
freshly uploaded to an app store should be flagged for re-                                    Background
moval because it behaves maliciously only after several min-          This section reviews existing techniques for gameplay
utes of interaction (Shen, Chien, and Hung 2014).                     knowledge extraction, automated testing of mobile apps, and
   Meanwhile, in automated gameplay via reinforcement                 existing techniques in AI for extrinsically and intrinsically
learning (RL), the ability to explore is critical for efficient       motivated automatic play.
Gameplay Knowledge Extraction                                      based on parsing graphical user interfaces assembled from
Our project is situated with the larger effort of automated        platform-provided building blocks (Amalfitano et al. 2012).
game design learning (AGDL) (Osborn, Summerville, and              However, these techniques break down for many games
Mateas 2017a) a research paradigm that encourages under-           that present custom interfaces drawn with low-level libraries
standing games by directly interacting with them (rather           such as OpenGL ES. Similarly, techniques based on static
than, e.g., source code analysis). In particular, our work is a    analysis of bytecode (Feng et al. 2014) break down when
direct response to a previous paper in the Knowledge Extrac-       native machine code is used (e.g. in optimized rendering
tion from Games workshop on representing and retrieving            or game physics code). By focusing our exploration of
game states with moment vectors (Zhan and Smith 2018).             videogames at the level of their display pixels and low-level
The automatic exploration strategies we offer in this paper        input events, we adopt a perspective that works across many
are a source of gameplay trajectories (screenshots, actions,       more game platforms.
and memory snapshots) that complements the use of human
gameplay data in that work.                                        Extrinsically Motivated Automatic Play
   Moment vectors for interactive media are a sub-symbolic         In most classic applications of AI in games (famously in
knowledge representation similar to that of word vectors           Chess and Go), the goal is to win the match or attain the
in natural language processing (Mikolov, Yih, and Zweig            optimal score. Monte Carlo Tree Search (MCTS) is one
2013). Although word vectors can be of direct use in text          surprisingly simple and effective strategy for finding a se-
retrieval applications, they can easily support more complex       quence of moves in a modeled environment that approxi-
tasks that rely on the ability to make inferences from a text’s    mately optimizes this score (Browne et al. 2012) and has
semantic content (Joulin et al. 2016).                             already been applied to strategy analysis in games (Zook,
   Zhang et al. (2018) describe a visual search engine for         Harrison, and Riedl 2015). These approaches all require a
game moments that is based on moment vectors derived               forward model (or simulator) that can list available actions,
from various gameplay sources. These authors also note a           identify terminal states (with their scores), and (un-)apply
curious property of moment vectors that we think would             actions in support of search. While constructing a fairly ac-
make them useful for inference in tasks far beyond retrieval:      curate forward model for the Atari 2600 using an existing
they support reasoning by analogy. Similar to linear alge-         emulator implementation like Stella1 is straightforward, the
bra analogies in word vector representation, moment vec-           larger memory sizes of recent platforms (e.g. Android) com-
tors learned with the Pix2Mem strategy appear to repre-            plicate snapshot-and-restore.
sent game-specific knowledge such as which power-ups the              Model-free reinforcement learning techniques are capable
player has collected or what their location is within the larger   of learning to play videogames without access to this kind of
game world.                                                        simulator. Playing without the ability to take back an action,
   AGDL systems can only learn from the parts of a game            Deep Q-Networks were recently shown (Mnih et al. 2015)
they actually experience. Missing from previous work in this       to learn super-human play styles for some Atari games (af-
domain is any way of quantifying coverage of a game’s con-         ter more than 38 days of simulated gameplay experience).
tent. Although it is difficult to define what it means to see      These results are impressive, but we seek broad state cover-
all or even enough of a game’s content, we can still make          age for much more recent games in much less time. Drop-
progress by judging when one exploration method has seen           ping the requirement of learning a search-free action policy
more than another. For the first time, the exploration quality     allows us to operate under much shorter timescales (minutes
metrics contributed in this paper will quantify this level of      rather than days). For our applications, search is an entirely
exposure.                                                          acceptable strategy for reaching new states.

App Testing for Mobile Markets                                     Intrinsically Motivated Automatic Play
Mobile app developers continually submit new software for          Montezuma’s Revenge is an Atari game that stands out as
distribution on services like Apple’s App Store or Google          particularly difficult for many deep reinforcement learning
Play, yielding many thousands of new apps per day. To bal-         algorithms because the rewards that motivate play (derived
ance the need for quality control with the financial incentive     from game score increases) occur so sparsely. In response,
to bring new apps to market in a timely manner, distributors       researchers are starting to add extra rewards representing in-
are pressured to judge whether an app should be published          trinsic motivations (such as curiosity about infrequently vis-
very quickly. In response, scalable techniques have been de-       ited states) to existing approaches (Bellemare et al. 2016).
vised to detect malware or trivial wrappings or repackagings       Although this might seem to distract from optimizing score,
of existing apps using just a few seconds of simulated inter-      it turns out to help agents discover new rewarding paths.
action (Chen et al. 2015). While this is useful for basic qual-    Further work has shown that exclusively using curiosity also
ity control, we believe more interaction is required to gather     leads to favorable results (Burda et al. 2018). In each of these
enough data to support content-based retrieval (Zhang et al.       projects, however, the primary metric used to argue for the
2018) for apps and the moments they contain.                       effectiveness of a method is to report the score associated
   Either by revenue or by popularity, games stand out as a        with the existing reward metric.
significant category of mobile apps. Some automated app
                                                                      1
testing techniques can guess at meaningful actions to try                 https://stella-emu.github.io/
   Intrinsic motivation for exploration is not a new topic,       Python. Depending on the complexity of the emulated plat-
even within applications to games (Merrick and Ma-                form and the exploration strategy, one second of simulated
her 2009). Other work in AI is specifically oriented              gameplay may take more or less than one second of wall-
towards mapping reachable spaces rather than learning             clock time—exploration need not occur in real-time.
ideal behavior policies. Rapidly Exploring Random Trees              To define the space of possible actions for each platform,
(RRT) (LaValle 1998) and Probabilistic Roadmaps (PRM)             we recorded ourselves playing a small number of games for
(Kavraki et al. 1996) are two such techniques originally in-      each. Most of our automated exploration strategies select ac-
vented for assisting in motion planning for robotics. Bauer       tions (controller input states) only from those represented in
et al. (2012) introduced RRT to the technical games research      this dataset.
community in an application to level design feedback.
   Iterative widening is another such exploration technique                       Exploration Strategies
that has been applied to games. Previous work has shown           Each of our exploration strategies interacts with games in
that, when operating over a set of predefined Boolean pixel       BizHawk over time. They may configure the controller state
features, IW can achieve scores comparable to humans in           (defining which buttons are pressed) and advance the logic
almost real time when it comes to playing Atari 2600 games        of the game. Strategies can decide whether any frame they
(Bandres, Bonet, and Geffner 2018). Our work focuses RRT          experience is included in their output collection of mo-
over IW because RRT makes use of a continuous feature             ments (represented as screenshot and memory state snapshot
space, exactly the type learned by Pix2Mem.                       pairs). Strategies may save and load any number of snap-
   In evolutionary computation, novelty search is the             shots they like.
paradigm that eschews optimizing a given fitness function in
favor of finding solutions that are different from those seen     Attract Mode The first and simplest exploration strategy
before. Techniques such as MAP-Elites (Mouret and Clune           we consider is associated with the concept of an attract
2015) are explicitly designed to yield a large archive of solu-   mode. Attract modes are a feature of many games derived
tions with Quality Diversity (QD) (Pugh, Soros, and Stanley       from arcade classics which played animations when idle to
2016).                                                            entice players to walk up and insert coins. Attract mode an-
   Rather than attempt to use intrinsically motivated rein-       imations often show snippets of actual gameplay, profiles of
forcement learning techniques directly as exploration strate-     characters, or cut-scenes revealing pieces of the main plot.
gies, our work focuses on using algorithms like RRT to reach      In the case where gameplay is shown, the game is typically
new moments in the game and understand how they relate            executing almost all the same code as in interactive play and
to one another. As in QD work, the emphasis is on the re-         a stored recording of representative play styles is seen. For
sulting archive. For applications where training a policy is      knowledge extraction purposes, these demonstrations are as
the desired goal, archives of interesting moments found by        valid as those uncovered by interactive play.
unrelated exploration strategies might provide a critical effi-      Our attract mode strategy sets the no-buttons-pressed con-
ciency boost for learning (Aytar et al. 2018).                    troller state and simply begins to advance game time by half
                                                                  a second each step. Although this strategy yields very inter-
                                                                  esting demonstrations of expert play and late-game content
                    Game Interface                                for some games, for many others it leaves the game sitting
Rather than directly taking on the full complexity of a mo-       at an uninteresting screen that a human player would have
bile game platform like Android, we consider a sequence of        quickly dismissed.
simpler game platforms. We start with Atari 2600 (a second-       Human Volunteers Our next strategy reliably yields sam-
generation game console), the platform commonly used in           ples of core gameplay behavior. We ask human volunteers
recent deep reinforcement learning experiments. We end at         to play the game while recording their input stream as a
Nintendo 64 (fifth-generation), a platform that modestly un-      BizHawk movie. These BK2 files are a common medium
dershoots the technical specifications of low-end Android         for sharing interaction traces in the TAS community. In our
devices. The feature most distinguishing the platforms we         experiments below, human data comes from players who
consider from Android (aside from touch-screen controls) is       have usually never played the specific game before. In some
the lack of networking capabilities. For better or for worse,     cases, our volunteers were not familiar with the language
many Android games are actually parts of larger distributed       of the text on the screen. For certain games in our collec-
systems. This means that we can save and restore game             tion, there are numerous movies of expert play available for
states for earlier game contents (as required by MCTS-style       download from the TAS community.3 For others, our team’s
forward models) if we are willing to pay the time and space       first look at a game came from browsing the results of attract
cost of doing so. For Android, however, snapshotting the          mode exploration.
state of an active network connection would be unrealistic.          To harvest a collection of moments from these pre-
   To interface with this broad range of game platforms, we       recorded movies, we play them back through BizHawk
selected the BizHawk2 emulator. Maintained by the tool-           keeping frames spaced half a second apart as in the at-
assisted speedrunning (TAS) community, BizHawk offers             tract mode strategy. Our volunteers did not make use of
an extensive scripting interface which we can access via          BizHawk’s save/load features to explore more thoroughly,
   2                                                                 3
       http://tasvideos.org/Bizhawk.html                                 http://tasvideos.org/Movies.html
and they were not given any specific instructions about what         to the goal. The resulting action and state form a new tree
play style to use.                                                   edge.
                                                                        This strategy makes extensive use of the ability to save
Chaos Monkey Our volunteers noted that often random
                                                                     and load game snapshots. To amortize the cost, the actions
button mashing was good enough to stumble through menus
                                                                     available to our RRT strategy operate over many frames. If
they could not read. Our chaos monkey strategy is inspired
                                                                     snapshot loading were not available, states could be recre-
by the “UI/Application Exerciser Monkey” distributed with
                                                                     ated by replaying actions along their path from the root
the Android developer tools.4 This tool injects a pseudo-
                                                                     node. In most applications of RRT, the selection of action
random stream of input events (e.g. touch-screen actions)
                                                                     to take next is parameterized by both the current state and
and system-level events (e.g. incoming phone calls) into an
                                                                     the goal location. However, as in the chaos monkey strat-
app under test in an effort to cause it to crash.
                                                                     egy, our baseline RRT strategy selects a random human-
   Our chaos monkey exploration strategy samples con-
                                                                     demonstrated action and holds it for a half-second.
troller states according to a platform-wide distribution of
controller states fit to our human gameplay collections. Ac-            In our work RRT functions as both a source of data for
tions are selected independent of the game, the contents of          knowledge extraction (producing screenshots and memory
the display, and of previous actions. Similar to the attract         states on which to train) as well as an application for ex-
mode strategy, we extract moments every half-second, hold-           tracted knowledge (distances between moments in the game
ing the controller state between steps.                              are judged by the similarity of their moment vectors). An
   Many of our games require the player to hit the con-              experiment described later in this paper looks at the mutual
troller’s START button at least once during traversal of early       benefit between Pix2Mem representation learning and RRT
game menus while not requiring this button later in play.            exploration.
As a result, our baseline chaos monkey strategy sometimes            Meta Strategies The strategies introduced above offer
even triggers attract mode animations for failure to select          many opportunities for local improvements. Without im-
this very rarely used input configuration in time. Certainly,        proving any individual strategy, we want to highlight how
there is much room for improvement.                                  elementary strategies might be synergistically combined.
Rapidly-Exploring Random Trees Drawing on state-                     When multiple strategies are applied to the same game, we
space exploration algorithms developed for robotics, this            ask what one can use of another’s exploration results.
strategy uses the Rapidly-Exploring Random Trees algo-                  Rather than always starting from the same boot state, al-
rithm (LaValle 1998). RRT is designed for exploring con-             gorithmic strategies can be used to create branches off of hu-
tinuous state spaces of moderately-low dimensionality. In            man volunteer data. Similarly, RRT can be modestly adapted
an application to robot arm movement, the state of the arm           to produce rapidly-exploring random forests that branch off
might be captured by joint angles. The ground-truth state            of state-space bookmarks placed by other exploration strate-
of our videogames, however, spans a very high-dimensional            gies. Rather than choosing actions independent of current
discrete space:5 the possible configuration of each byte of          game/state/goal, data from other strategies might be used to
main memory and other subsystems. Even the space of dis-             train a predictive model that selects a more relevant action by
play pixels is another high-dimensional discrete space. In           looking at the moment vectors of the current and goal states.
response, we project screenshots into 256-dimensional mo-            One hybrid branching strategy is experimentally examined
ment vectors using the Pix2Mem deep convolutional net-               later in this paper.
work strategy (Zhan and Smith 2018). Pix2Mem is a repre-                Human volunteers, with access to a visualization of mo-
sentation learning strategy in which the contents of memory          ments extracted by other strategies, might be able to find
are predicted from screenshot pixels. The bottleneck layer           more interesting moments from which to start their own
of this network serves as a compact, semantic representation         gameplay. This visualization might take the form of a tSNE
of what is happening in the screenshot.                              (Maaten and Hinton 2008) plot of previously-seen moments
   RRT works by growing a tree that captures how to reach            (making use of the rich moment vector representation). Hu-
points of the game’s state space from some initial state (for        mans and algorithms together might be able to reach corners
us, the moment the platform has booted with a game car-              of the space that neither would have explored on their own
tridge installed). Nodes in the tree are previously seen states,     in the same amount of total gameplay time.
and edges (annotated with action sequences) describe how
to get from one state to another. The algorithm continually          Exploration Metrics
picks a random goal location in the continuous moment vec-           How should we judge whether an exploration strategy is
tor space, finds which existing tree node is closest to it, and      continuing to make progress or if one is making progress
performs an action from that state in an attempt to get closer       more efficiently than another? We would like to be able to
    4                                                                directly observe how much territory is covered by a strategy
      https://developer.android.com/studio/
test/monkey                                                          by consulting a map. In our work, the space of rich, semantic
    5
      Even though algorithms like IW can naturally operate on dis-   moment vectors functions as this map.
crete representations, these algorithms would not make use of the       By contrast to reinforcement learning research, we should
knowledge implicitly represented in moment vectors to judge the      emphasize that our goal is to assess the diversity of moments
significance of observed differences in display pixels or memory     an exploration strategy has extracted, not to judge its game-
bytes.                                                               play behavior. Competent use of state saving/loading actions
is highly relevant to exploration quality while not actually     used in the RRT strategy above). Zhan and Smith (2018)
representing any in-game behavior at all.                        showed that this representation was useful for identifying
                                                                 moments by scene as well as position within scene. This is
Measuring the Spread of Points                                   our game-specific embedding strategy.
Our preliminary exploration metrics are built on intuitions         To compare exploration across games and across game
for how we might explore a low-dimensional, continuous           platforms, we cannot assume we have pre-established game-
space. Each moment extracted by an exploration strategy          play datasets available to train instances of Pix2Mem. Con-
represents a point somewhere in this space. Intuitively, ex-     sider this game-independent three-dimensional embedding:
ploration strategies should produce a cloud of points that is    the average RGB color of display pixels. When characters or
well spread out in this space. How should we measure this?       other display elements move across colorful backgrounds by
   One strategy is to draw an axis-aligned box (or hyper-        small amounts, the average color will only change a bit (if
cube) around all of the points and then to report the area (or   at all). Scene transitions, which might be marked by fades
hypervolume) of the box. As new points are explored and          and flashes followed by new screens with different domi-
dropped into this space, the box can only grow when points       nant colors, might be detected in this space. Three dimen-
fall outside the previous bounds, representing increasingly      sions is clearly too small to be useful (particularly consider-
wider coverage. To account for degenerate boxes (which are       ing what flashes of all-white and all-black would do under
well spread in some dimensions but not at all in others) we      the bounding box sum metric), but we can still recover the
report the sum of the lengths of the sides of the box rather     idea of a game-independent perceptual space. Recycling a
than their product. This is our bounding box sum metric.         specific6 pre-trained neural network (the Inception-V3 net-
   Another strategy pays less attention to extreme points and    work trained to convergence on ImageNet), we can embed
more to the distribution of points within the cloud. We pro-     screenshots into a 1000-dimensional space. Even though the
pose to fit a (likely highly anisotropic) multi-dimensional      output dimensions from this network are unrelated to the ob-
Gaussian distribution to the points. The directions of max-      jects appearing in our videogame screenshots, it still seems
imal variation (derivable from the covariance matrix of the      to work in our experiments below. The all-white-all-black
point data) play a similar role to the sides of the bounding     catastrophe for the bounding box sum metric, in particular,
box above, with variance being analogous to hypervolume.         cannot occur with this embedding because the elements of
Again, to account for potentially degenerate distributions       the embedded vector are constrained to always add up to
(for which the data is spread out only in a lower-dimensional    one. This is our game-independent embedding strategy.
subspace), we simply sum the eigenvalues of the covariance          Fig. 1 compares our two spread metrics with two embed-
matrix (finding its nuclear norm) rather than forming their      ding functions for a run of the human volunteer exploration
product (the determinant). This is our nuclear norm metric.      strategy. Using the game-specific embedding (Pix2Mem
   By contrast with the bounding box sum metric, the nu-         trained on data from a longer human play session), each
clear norm metric may decrease as new points are explored        key moment (identified by visual inspection of the game-
if those points are relatively more concentrated than the ini-   play video) is associated with kinks in the graph for both
tial set of points. This can happen for exploration strategies   spread metrics. For the game-independent (Inception) em-
that spend a long time in one game mode or screen after ini-     bedding, broad trends are somewhat preserved. The nuclear
tially traversing menus with rich visual variation.              norm metric temporarily decreases for the game-specific
                                                                 embedding but not the game-independent embedding be-
Embedding Game Moments into a Space                              cause the game-specific embedding knows our human player
Given that we can evaluate the spread of points in some          is still focused on one specific level for a long time while
space, how do we embed explored game moments into that           the game-independent metric is presumably more sensitive
space in the first place? A useful embedding function should     to background-art details that are scrolling by as the level
place similar moments nearby one another and dissimilar          progresses.
moments far apart if our intuitions about the spread of points
are to be effective.                                                                   Experiments
    Consider a game-specific embedding function for a game       This section describes experiments applying our exploration
like Super Mario World for the Super Nintendo Enter-             quality metrics to our elementary exploration strategies.
tainment System. In one three-dimensional embedding, we
might use one dimension to represent the game scene (an or-      Core Gameplay Coverage
dered index of game menus and then sequential level iden-        Can our automated exploration strategies make progress
tifiers) and two more dimensions to represent the position       comparable to our human volunteers?
of the player character within that scene. As an exploration        Fig. 2 visualizes the game-specific nuclear norm metric
strategy completes one scene and moves to the next, points       as a function of time for four exploration strategies. The
on a new plane are discovered and discontinuous movement         attract mode strategy never starts core gameplay, however
is represented in the positional dimensions. Although the        it does experience a short recording of example gameplay
truth of which scene we are in and where our character is in     in the game’s attract mode looping behind the main menu.
the scene could be computed from memory bytes, we con-
sider a more general family of game-specific embeddings. In        6
                                                                     https://keras.io/applications/
particular, we reuse the Pix2Mem embedding strategy (also        #inceptionv3
                                                                 timescale that would be highly impractical for current deep
                                                                 reinforcement learning approaches.
                                                                    This experiment anchors our exploration metrics in a
                                                                 qualitative analysis of the modes and levels of one well-
                                                                 known game. In the remaining experiments, we do not per-
                                                                 form a qualitative analysis as even the authors are not suffi-
                                                                 ciently familiar with the games in question to make a clear
                                                                 statement about what level of exploration would be enough
                                                                 for any given application.

                                                                 Bootstrapping Perception
                                                                 The experiment above executed RRT using a fixed embed-
                                                                 ding based on Pix2Mem trained on previous human game-
                                                                 play for that specific game. Without falling back to a game-
                                                                 independent embedding (which would be reasonable any-
                                                                 way), can automatic exploration data itself be used as a
                                                                 source of training data for RRT’s visual perception model?
                                                                    In this experiment, we consider running RRT in the same
Figure 1: Comparison of exploration quality metrics for a 5-     game for 15 minutes at a time using different screenshot
minute run of human volunteer data in Super Mario World.         embedding functions. Our network starts untrained (weights
The marked moments represent approximately when (A) the          randomly initialized). Although this embedding function has
main menu fades in, (B) the overworld map fades in, (C)          had no experience of the game under test (or any other game)
first core (platformer) gameplay begins, and (D) first core      it still supports automatic exploration, albeit at reduced ef-
gameplay completes.                                              ficiency. After the time is up, we incrementally train our
                                                                 network on the results of exploration run and discard the
                                                                 tree structure. We repeat this process of exploring and re-
                                                                 training several more times. Even though the final run of
                                                                 the algorithm still experiences only the same amount of to-
                                                                 tal gameplay time and starts from the same initial state, it
                                                                 makes better progress through the game as a result of a bet-
                                                                 ter (game-specific) visual perception model.
                                                                    Fig. 3 shows the Inception7 nuclear norm metric for four
                                                                 runs of RRT. Notice how more perceptual experience allows
                                                                 the fixed algorithm to explore both faster and deeper. The
                                                                 fact that the bulk of the gains are made after just the first
                                                                 15 minutes of gameplay experience is promising for the use
                                                                 of game-specific embeddings even for games no human re-
Figure 2: Exploration progress (by Pix2Mem nuclear norm          viewers or testers have yet played (as in the app testing sce-
metric) on Super Mario World by various strategies.              nario).

                                                                 Cooperation between Strategies
Chaos monkey manages to start core gameplay almost as
                                                                 On a fixed budget, how should resources be allocated be-
fast as the human volunteer, however it cannot sustain the
                                                                 tween different exploration strategies? Fig. 4 compares
diversity of human-demonstrated moments. RRT starts slow
                                                                 equal-time applications of chaos monkey and RRT with a
(often deciding to explore many moments selected from
                                                                 hybrid strategy that switches from chaos monkey to RRT at
the animated transitions between scenes during which the
                                                                 the midpoint. The hybrid uses 100 randomly selected chaos
player has no meaningful control) but eventually surpasses
                                                                 monkey results to seed a forest of trees. As above, we show
the spread from the other methods. Trends indicate that RRT
                                                                 the Inception nuclear norm metric. Surprisingly, the hybrid
would benefit from being able to explore for longer while the
                                                                 strategy directly benefits from the quick start of chaos mon-
other strategies (including our human volunteer, one of the
                                                                 key while retaining the growth trend from RRT. The hybrid
authors with imperfect gameplay skill) would not.
                                                                 strategy covers more ground in equal time. In this case, only
   In a qualitative analysis of the screenshots extracted dur-   the set of seed states for exploration has been recycled, not a
ing exploration, we found that RRT could reach core game-        learned knowledge representation trained from chaos mon-
play in Super Mario World within two minutes when started        key’s experience. In the previous experiment, we preserved
from the game’s boot state. When started from the first mo-      the perceptual knowledge and not the concrete seed states.
ment of core gameplay, RRT could finish the level within
five minutes (less than 600 tree-expansion steps). Results          7
                                                                     We used a game-independent metric for this experiment to
like this encourage us to believe automatic exploration          avoid any confusion with metrics that might themselves be based
could be very useful for testing modern mobile games on a        on exploring this specific game
Figure 3: RRT exploration progress (by Inception nuclear
norm metric) for Super Mario World using self-play to train
the perceptual embedding network.


                                                              Figure 5: Exploration progress after 10 minutes of simulated
                                                              gameplay (by Inception nuclear norm metric) across games.


                                                              Performance across Games and Platforms

                                                              Our final experiment compares the performance of our ele-
                                                              mentary exploration strategies across a number of commer-
                                                              cial games selected from different BizHawk-supported game
                                                              platforms. Because we use the same metric for each game
                                                              (Inception nuclear norm after 10 minutes of simulated game-
                                                              play), scores may be compared across games and platforms.
                                                              However, the range of scores depends both on game design
                                                              issues (such as the degree to which the Inception network
                                                              distinguishes visual variety in the game’s display and how
                                                              much variety can actually be experienced in just 10 minutes)
                                                              and the utility of the exploration strategy. For randomized
                                                              exploration strategies (chaos monkey and RRT), we average
                                                              the score for 10 exploration runs.
                                                                 Fig. 5 samples games from Atari 2600 (128 bytes of mem-
Figure 4: Exploration progress (by Inception nuclear norm     ory), Game Boy (8 KiB), Super Nintendo (128 KiB), and
metric) for 30 total minutes of Super Mario World gameplay    Nintendo 64 (4 MiB). Notably, it is possible to surpass hu-
using chaos monkey, RRT, and a combination where RRT          man exploration quality in some games, and no single strat-
explores from points first found by chaos monkey.             egy consistently out/under-performs others. From our boot-
                                                              strapped perception experiment above, we know RRT has
                                                              a slow start, particularly on the 10-minute scale and with
                                                              a poor perceptual embedding (it uses the 1000-dimensional
                                                              Inception embedding space in this experiment). We find
                                                              these results encouraging for the possibility that modest im-
                                                              provements to our baseline strategies could yield human-
                                                              comparable exploration on similar timescales.
                       Conclusion                               through static analysis. In Proceedings of the 22Nd ACM
We introduce the problem of automatically exploring a           SIGSOFT International Symposium on Foundations of Soft-
game’s state space with the goal of extracting a useful se-     ware Engineering, FSE 2014, 576–587. New York, NY,
mantic map on timescales comparable to human playtesting        USA: ACM.
efforts. We introduce elementary exploration strategies and     Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T.
demonstrate their application on a array of games and game      2016. Bag of tricks for efficient text classification. CoRR
platforms. To quantify exploration progress, we define four     abs/1607.01759.
quality metrics that reveal different aspects of exploration    Kaltman, E.; Osborn, J.; Wardrip-Fruin, N.; and Mateas, M.
patterns while exploiting pre-existing reference data when      2017. Getting the GISST: A toolkit for the creation, analysis
available. To compare exploration efficiency with human         and reference of game studies resources. In Proceedings
playtesters, we propose to compare quality metrics as a func-   of the 12th International Conference on the Foundations of
tion of the total gameplay time used in exploration. Experi-    Digital Games, FDG ’17, 16:1–16:10. New York, NY, USA:
mentally, we demonstrate that with a very modest amount of      ACM.
gameplay time, automatic exploration strategies can reach       Kavraki, L. E.; Svestka, P.; Latombe, J.-C.; and Overmars,
interesting gameplay moments (such as the completion of         M. H. 1996. Probabilistic roadmaps for path planning in
the first level in Super Mario World), bootstrap their own      high-dimensional configuration spaces. IEEE transactions
game-specific perception models, and leverage data pro-         on Robotics and Automation 12(4):566–580.
duced by other exploration methods. Finally, we offer the
first comparison of exploration progress across multiple        LaValle, S. M. 1998. Rapidly-exploring random trees: A
games and platforms using a single platform-independent         new tool for path planning. Technical Report TR 98-11,
metric.                                                         Computer Science Department, Iowa State University.
                                                                Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using
                       References                               t-sne. Journal of machine learning research 9(Nov):2579–
Amalfitano, D.; Fasolino, A. R.; Tramontana, P.;                2605.
De Carmine, S.; and Memon, A. M. 2012. Using                    Merrick, K. E., and Maher, M. L. 2009. Motivated rein-
GUI ripping for automated testing of android applications.      forcement learning: curious characters for multiuser games.
In Proceedings of the 27th IEEE/ACM International Con-          Springer Science & Business Media.
ference on Automated Software Engineering, 258–261.             Mikolov, T.; Yih, W.-t.; and Zweig, G. 2013. Linguistic
ACM.                                                            regularities in continuous space word representations. In
Aytar, Y.; Pfaff, T.; Budden, D.; Paine, T. L.; Wang, Z.; and   Proceedings of the 2013 Conference of the North American
de Freitas, N. 2018. Playing hard exploration games by          Chapter of the Association for Computational Linguistics:
watching youtube. CoRR abs/1805.11592.                          Human Language Technologies, 746–751.
Bandres, W.; Bonet, B.; and Geffner, H. 2018. Planning          Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
with pixels in (almost) real time. CoRR abs/1801.03354.         ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;
Bauer, A. W., and Popovic, Z. 2012. RRT-based game level        Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-
analysis, visualization, and visual refinement. In Proceed-     level control through deep reinforcement learning. Nature
ings of the AAAI Conference on Artificial Intelligence in In-   518(7540):529.
teractive Entertainment.                                        Mouret, J.-B., and Clune, J. 2015. Illuminating search
Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.;       spaces by mapping elites. arXiv preprint arXiv:1504.04909.
Saxton, D.; and Munos, R. 2016. Unifying count-based            Osborn, J. C.; Summerville, A.; and Mateas, M. 2017a. Au-
exploration and intrinsic motivation. In Advances in Neural     tomated game design learning. In 2017 IEEE Conference on
Information Processing Systems, 1471–1479.                      Computational Intelligence and Games (CIG), 240–247.
Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.;        Osborn, J.; Summerville, A.; and Mateas, M. 2017b. Auto-
Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.;        matic mapping of NES games with mappy. In FDG ’17 Pro-
Samothrakis, S.; and Colton, S. 2012. A survey of Monte         ceedings of the 12th International Conference on the Foun-
Carlo tree search methods. IEEE Transactions on Computa-        dations of Digital Games. ACM.
tional Intelligence and AI in Games 4(1):1–43.                  Pugh, J. K.; Soros, L. B.; and Stanley, K. O. 2016. Qual-
Burda, Y.; Edwards, H.; Pathak, D.; Storkey, A.; Darrell, T.;   ity diversity: A new frontier for evolutionary computation.
and Efros, A. A. 2018. Large-scale study of curiosity-driven    Frontiers in Robotics and AI 3:40.
learning. In arXiv:1808.04355.                                  Shen, Y.-C.; Chien, R.; and Hung, S.-H. 2014. Toward effi-
Chen, K.; Wang, P.; Lee, Y.; Wang, X.; Zhang, N.; Huang,        cient dynamic analysis and testing for Android malware. IT
H.; Zou, W.; and Liu, P. 2015. Finding unknown malice in        CoNvergence PRActice (INPRA) 2(3):14–23.
10 seconds: Mass vetting for new threats at the Google-Play     Zhan, Z., and Smith, A. M. 2018. Retrieving game states
scale. In 24th USENIX Security Symposium (USENIX Secu-          with moment vectors. In Proceedings of the AAAI 2018
rity 15), 659–674. Washington, D.C.: USENIX Association.        Workshop on Knowledge Extraction from Games.
Feng, Y.; Anand, S.; Dillig, I.; and Aiken, A. 2014. Ap-        Zhang, X.; Zhan, Z.; Holtz, M.; and Smith, A. M. 2018.
poscopy: Semantics-based detection of Android malware           Crawling, indexing, and retrieving moments in videogames.
In FDG ’18 Proceedings of the International Conference on
the Foundations of Digital Games. ACM.
Zook, A.; Harrison, B.; and Riedl, M. 2015. Monte-carlo
tree search for simulation-based play strategy analysis. In
Proceedings of the 10th International Conference on the
Foundations of Digital Games, FDG 2015, Pacific Grove,
CA, USA, June 22-25, 2015.