Co-Creative Level Design via Machine Learning

                                  Matthew Guzdial, Nicholas Liao, and Mark Riedl
                                                   College of Computing
                                              Georgia Institute of Technology
                                                    Atlanta, GA 30332
                                mguzdial3@gatech.edu, nliao7@gatech.edu, riedl@cc.gatech.edu


                           Abstract                               to demonstrate the following points: (1) existing meth-
                                                                  ods are insufficient for co-creative level design, and (2)
  Procedural Level Generation via Machine Learning                co-creative PLGML requires training on examples of co-
  (PLGML), the study of generating game levels with machine
  learning, has received a large amount of recent academic
                                                                  creative PLGML or an approximation. In support of this ar-
  attention. For certain measures these approaches have shown     gument we present results from a user study in which users
  success at replicating the quality of existing game levels.     interacted with existing PLGML approaches adapted to co-
  However, it is unclear the extent to which they might benefit   creation and quantitative experiments comparing these ex-
  human designers. In this paper we present a framework for       isting approaches to approaches designed for co-creation.
  co-creative level design with a PLGML agent. In support
  of this framework we present results from a user study and                           Related Work
  results from a comparative study of PLGML approaches.
                                                                  The concept of co-creative PCGML has been previously dis-
                                                                  cussed in the literature (Summerville et al. 2017; Zhu et al.
                       Introduction                               2018), but no prior approaches or systems exist. Compar-
Procedural content generation via Machine Learning                atively there exist many prior approaches to co-creative or
(PCGML) has drawn increasing academic interest in re-             mixed-initiative level design agents without machine learn-
cent years (Summerville et al. 2017). In PCGML a machine          ing (Smith, Whitehead, and Mateas 2010; Yannakakis, Li-
learning model trains on some existing corpus of game con-        apis, and Alexopoulos 2014; Deterding et al. 2017). Instead
tent to learn a distribution over possible game content. New      these approaches rely upon search or grammar-based ap-
content can then be sampled from this distribution. This ap-      proaches (Liapis, Yannakakis, and Togelius 2013; Shaker,
proach has shown some success at replicating existing game        Shaker, and Togelius 2013; Baldwin et al. 2017). Thus these
content, particularly of game levels, according to user stud-     approaches require significant developer effort to adapt to a
ies (Guzdial and Riedl 2016) and quantitative metrics (Snod-      novel game.
grass and Ontanón 2017; Summerville 2018). The practical
application of PCGML approaches has not yet been investi-                                User Study
gated. One might naively suggest that PCGML could serve           As an initial exploration into co-creative level design via
as a cost-cutting measure given its ability to generate new       machine learning we conducted a user study. We began
content that matches existing content. However, this requires     by taking existing procedural level generation via machine
a large corpus of existing game content. If designers for a       learning (PLGML) approaches and adapting them to co-
new game produced such a corpus, they might as well use           creation. We call these adapted approaches AI level design
that corpus for the final game. Beyond this issue, a learned      partners. Our intention with these partners is to determine
distribution is not guaranteed to contain a designer’s desired    the strengths and weaknesses of these existing approaches
output.                                                           when applied to co-creation and the extent to which these
   A co-creative framework could act as an alternative to         existing approaches are sufficient for this task.
asking designers to find desired output from a learned dis-          We make use of Super Mario Bros. as the domain for this
tribution . In a co-creative framework, also called mixed ini-    study and later experiments given that all three of the exist-
tiative, a human and AI partner work together to produce          ing PLGML approaches had previously been applied to this
final content. In this way, it does not matter if an AI partner   domain. Further, we anticipated its popularity would lead to
is incapable of creating some desired output alone.               more familiarity from our study participants.
   In this paper we propose an approach to co-creative
PCGML for level design or Procedural Level Generation             Level Design Editor
via Machine Learning (PLGML). In particular, we intend
                                                                  To run our user study we needed some Level Design Editor
                                                                  to serve as an interface between participants and the AI level
                                                                  design partners. For this purpose we made use of the editor
                          Figure 1: Screenshot of the Level Editor, reproduced from (Guzdial et al. 2017).


from (Guzdial et al. 2017), which is publicly available on-          • Markov Chain: This approach is a Markov chain based
line.1 We reproduce a screenshot of the interface from the             on Snodgrass and Ontanón (2014), based on Java code
paper in Figure 1. The major parts of the interface are as             supplied by the authors. It trains on existing game levels
follows:                                                               by deriving all 2-by-2 squares of tiles and deriving prob-
• The current level map in the center of the interface, which          abilities of a final tile from the remaining three tiles in
  allows for scrolling side-to-side                                    the square. We made use of the same representation as
                                                                       that paper, which represented elements like enemies and
• A minimap on the bottom left of the interface, users can             solid tiles as equivalent. To convert this representation to
  click on this to jump to a particular place in the level             the editor representation we applied rules to determine the
• A palette of level components or sprites in the middle of            appropriate sprite from the solid tile class based on its
  the bottom row                                                       position and chose randomly from available enemies for
                                                                       the enemy class (with the stipulation that flying enemies
• An “End Turn” button on the bottom right. By pressing                could only appear in the air). Otherwise, our only vari-
  this End Turn button the current AI level design partner             ation from this baseline was to limit the number of new
  is queried for an addition. A pop-up appears while the               generated tiles to a maximum of thirty per turn.
  partner processes, and then its additions are added sprite-
  by-sprite to the main screen. The camera scrolls to follow         • Bayes Net: This approach is a probabilistic graphical
  each addition, so that the user is aware of any changes to           model or hierarchical Bayesian network based on Guz-
  the level. The user then regains control and level building          dial and Riedl (2016). It derives shapes of sprite types
  continues in this turn-wise fashion.                                 and samples from a probability of relative positions to
                                                                       determine the next sprite shape to add and where. This
At any time during the interaction users can hit the top left
                                                                       approach was originally trained on gameplay video, thus
“Run” button to play through the current version of the level.
                                                                       we split each level into a set of frame-sized chunks, and
A backend logging system tracks all events, including ad-
                                                                       generated an additional shape for each chunk. This ap-
ditions and deletions and which entity (human or AI) was
                                                                       proach was already iterative and so naturally fit into the
responsible for them.
                                                                       turn-based level design format. We do not limit the num-
                                                                       ber of additions, but the agent only made additions when
AI Level Design Partners                                               there was a sufficient probability, and thus almost always
For this user study we created three AI agents to serve as             produced fewer additions than the other agents.
level design partners. Each is based on a previously pub-
lished PLGML approach, adapted to work in an iterative               • LSTM: This approach is a Long Short Term Memory
manner to fit the requirements of our level editor interface.          Recurrent Neural Network (LSTMRNN or just LSTM)
We lack the space to fully describe each system but cover              based on Summerville and Mateas (2016), recreated in
a high-level summary of the approaches and our alterations             Tensorflow from the information given in the paper and
below.                                                                 training data supplied by the authors. It takes as input
                                                                       a game level represented as a sequence and outputs the
   1
       https://github.com/mguzdial3/Morai-Maker-Engine                 next tile type. We modified this approach to a bidirectional
                                                                 above ground or below ground level. We supplied two op-
                                                                 tional examples of the first two levels of each type taken
                                                                 from the original Super Mario Bros.. This leads to a total of
                                                                 twelve possible conditions in terms of pair of partners, order
                                                                 of the pair, and order of the level design assignments.
                                                                    Participants were given a maximum of fifteen minutes
                                                                 for each task, though most participants finished well before
                                                                 then. Participants were asked to press the “End Turn” button
                                                                 to interact with their AI partner at least once. Those who did
                                                                 not do so had their results thrown out.
                                                                    After both rounds of interaction participants took a brief
                                                                 survey in which they ranked the two partners they interacted
                                                                 with in terms of fun, frustration, challenge to work with, the
                                                                 partner that most aided the design, the partner that lead to
                                                                 the most surprising or valuable ideas, and which of the two
                                                                 partners the participant would most like to use again. We
                                                                 also gave participants the option to leave a comment reflect-
                                                                 ing on each agent. The survey ended by collecting demo-
                                                                 graphic data including experience with level design, Super
                                                                 Mario Bros., games in general, the participant’s gender (we
                                                                 collected gender in a free response field), and age.

Figure 2: Examples of six final levels from our study, each      Results
pair of levels from a specific co-creative agent: Markov         In this subsection we discuss an initial analysis of the results
Chain (top), Bayes Net (middle), and LSTM (bottom). These        of our user study. Overall 91 participants took part in this
levels were selected at random from the set of final levels,     study. However, of these seven participants did not interact
split by co-creative agent.                                      with one or both of their partners, and we removed them
                                                                 from our final data. The remaining 84 participants were split
                                                                 evenly between the twelve possible conditions, meaning a
  LSTM given it was collaborating and not just building a        total of seven participants for each condition.
  level from start to end. We further modified the approach         62% of our respondents had previously designed Mario
  to only make additions to a 65-tile wide chunk of the level,   levels at least once before. This is likely due to prior experi-
  centered on the user’s current camera placement in the ed-     ence playing Mario Maker, a level design game/tool released
  itor. As with the Markov Chain we limited the additions        by Nintendo on the Wii U. Our subjects were nearly evenly
  to 30 at most, and converted from the agent’s abstract rep-    split between those who had never designed a level before
  resentation to the editor representation according to the      26%, designed a level once before 36%, or had designed
  same process.                                                  multiple levels in the past 38%. All but 7 of the subjects had
We chose these three approaches as they represent the most       previously played Super Mario Bros., and all the subjects
successful prior PLGML approaches in terms of depth and          played games in general regularly.
breadth of evaluations. Further, each approach is distinct          Our first goal in analyzing our results was to determine if
from the other two. For example, each approach has a differ-     the level design task (above or underground) mattered and
ence in terms of local vs. global reasoning, with the Markov     if the ordering of the pair of partners mattered. We ran a
Chain being hyper-local (only generating based on a 2x2          one-way repeated measures ANOVA and found that neither
square) to the much more global LSTM approach which              variable lead to any significance. Thus, we can safely treat
reads in almost the entirety of the current level. Notably,      our data as having only three conditions, dependent on the
because all three approaches were previously used for au-        pair of partners each subject interacted with.
tonomous generation, the agents could only make additions           We give the ratio of first place to second place rankings
to the level, never any deletions. We did not put any effort     for each partner in Table 1. Therefore one can read the re-
to including deletions in order to minimize the damage the       sults as the Markov Chain agent being generally preferred,
agent could cause to a user’s intended design of a level.        though more challenging to use. Comparatively, the Bayes
                                                                 net agent was considered less challenging to use, but also
                                                                 less fun, with subjects less likely to want to reuse the agent.
Study Method                                                     The LSTM on the other hand had the worst reaction overall.
Each study participant went through the same process. First,        The ratio of ranking results would seem to indicate a clear
they were given a short tutorial on the level editor and its     ordering of the agents. However, this is misleading. We ap-
function. They then interacted with two distinct AI partners     plied the Kruskal Wallis test to the results of each question
back-to-back. The partners were assigned at random from          and found it unable to reject the null hypothesis that all of
the three possible options. During each interaction, the user    the results from all separate agents arose from the same dis-
was assigned one of two possible tasks, either to create an      tribution. This indicates that in fact the agents are too close
                   Table 1: A table comparing the ratio by which each system was ranked first or second.
                         Most Fun Most Frustrating Most Challenging Most Aided Most Creative                         Reuse
         Markov Chain      33:23          26:30                29:27           30:26             33:23               32:24
         Bayes Net         27:29          26:30                20:36           31:25             29:27               28:28
         LSTM              24:32          32:24                35:21           23:33             22:34               24:32


in performance to state a significant ordering. In fact, many        representations of the actions taken during each partner’s
subjects greatly preferred the LSTM agent over the other             turns. We also have final scores in terms of the user rank-
two, stating that it was “Pretty smart overall, added elements       ing. These final scores could serve as reward or feedback
that collaborate well with my ideas” and “This agent seemed          to a supervised learning system, however, we would ideally
to build towards an ‘idea’ so to speak, by adding blocks in          like some way to assign partial credit to all of the actions the
interesting ways”.                                                   AI agent took to receive those final scores. Towards this pur-
                                                                     pose we decided to model this problem as a general, semi-
User Study Results Discussion                                        Markov Decision Process (SMDP) with concurrent actions
These initial results of our user study do not indicate a            as in (Rohanimanesh and Mahadevan 2003).
clearly superior agent. Instead, they suggest that individ-             Our SMDP with concurrent actions is from the AI part-
ual participants varied in terms of their preferences. This          ner’s perspective, given that we wish to use it to train a new
matches our own experience with the agents. When attempt-            AI partner. It has the following components:
ing to build a very standard Super Mario Bros. level, the            • State: We represent the level at the end of each human
LSTM agent performed well. However, as is common with                  user turn as the state.
deep learning methods it was brittle, defaulting to the most
common behavior (e.g. adding grounds or blocks) when                 • Action: Each single addition by the agent per turn then
confronted with unfamiliar input. In comparison the Bayes              becomes a primary action, with the total turn representing
net agent was more flexible, and the Markov Chain agent                the concurrent action.
more flexible still, given its hyper-local reasoning.                • Reward: For the reward we make use of the Reuse rank-
   We include two randomly selected levels for each agent              ing, as it represents our desire that the agent be helpful and
in Figure 2. They clearly demonstrate some departures from             usable first and foremost. In addition, we include a small
typical Super Mario Bros. levels, meaning none of these lev-           negative reward (-0.1) if the user deletes an addition made
els could have been generated by any of these agents. Given            by the AI partner. We make use of a γ value of 0.1 in or-
this, and the results of the prior section, we have presented          der to determine partial credit across the sequences of AI
some evidence towards the first part of our argument, that             partner actions.
existing methods are insufficient to handle the task of co-
creative level design. By which we mean, no existing agents          Due to some network drops, some of the logs for our study
are able to handle the variety of human level design or hu-          were corrupted. Thus we ended up with 122 final sequences
man preferences when it comes to AI agent partners. We will          from our logs. We split this dataset into a 80-20 train-test
present further evidence towards this and the second point in        split by participant, ensuring that our test split only included
the following sections.                                              participants with both logs from both interactions uncor-
                                                                     rupted. Thus we had the logs of 11 participants held out for
          Proposed Co-Creative Approach                              testing purposes.
The results of the prior section indicate a need for an ap-             We further divided each state-action-reward triplet such
proach designed for co-creative PLGML instead of adapted             that we represent each state as a 40x15x32 matrix and each
from autonomous PLGML. In particular, given that none of             action as a 40x15x32 matrix. The state represents a screens
our existing agents were able to sufficiently handle the va-         worth of the current level (40x15), and the action represents
riety of participants, we expect instead a need for an ideal         the additions made over that chunk of level. The 32 in this
partner to either more effectively generalize across all po-         case is a one-hot encoding of sprites, based on the 32 possi-
tential human designers or to adapt to a human designer ac-          ble sprites in the editor’s sprite palette. We did this in order
tively during the design task. We present a proposed archi-          to further increase the amount of training data. This lead to
tecture based on the results of the user study, and present          a total of 1501 training samples and 242 test samples.
both pre-trained and active learning variations to investigate
these possibilities.                                                 Architecture
                                                                     From our user study we found that local coherency (Markov
Dataset                                                              Chain) tended to outperform global coherency (LSTM).
For the remainder of this paper we make use of the results of        Thus for a proposed co-creative architecture we chose to
the user study as a dataset. In particular, as stated in the Level   make use a Convolutional Neural Network (CNN). A CNN
Design Editor subsection, we logged all actions by both hu-          is capable of learning local features that impact decision
man and AI agent partners. These logs can be considered              making, and to replicate those local features for generation
                                                                 requires training on examples of co-creative PLGML or an
Table 2: A table comparing the summed reward each agent          approximation. Thus our proposed approach can be consid-
receives on the test data.                                       ered the former of these two and the variation of our ap-
   participant Ours SMB          MC     GR    LSTM
                                                                 proach trained on the Super Mario Bros. dataset the latter.
   0              1.45     7.34  10.0  0.00    10.0
                                                                 If these two approaches outperform the three baselines we
   1              1.32 -4.63 -4.00 -1.00       -6.00
                                                                 will have evidence for this, and our first claim that existing
   2              0.00     0.00  0.00  0.00    0.00              PLGML methods were insufficient for co-creation.
   3              -0.53 -1.57 0.00     0.00    -3.00
   4              0.01     0.31  0.00  0.00    0.00
                                                                 Pretrained Evaluation Results
   5              5.50     1.36  0.00  0.00    1.00
   6              0.29 -0.07 0.00      0.00    0.00              We summarize the results of this evaluation in Table 2. The
   7              0.10     1.00  2.00  0.00    1.00              columns represent in order the results of our approach, the
   8              -0.14 -10.1 -60.1 0.00       -40.2             SMB-trained variation of our approach, the Markov Chain
   9              3.85     14.0  0.00  0.00    -1.10             baseline, the Bayes net baseline, and the LSTM baseline.
   10             -3.01 -5.89 0.00     0.00    0.00              The rows represent the results separated by each participant
                                                                 in our test set. We separate the results in this way given the
   Avg %          53.9     0.8   -0.6  -0.0     -0.5
                                                                 variance each participant displayed, and since the total pos-
                                                                 sible reward would depend upon the number of interactions,
                                                                 which differed between participants. Further, each partici-
purposes. Further, they have shown success in approximat-        pant must have given both a positive and negative final re-
ing the Q-table in more traditional deep reinforcement learn-    ward (ranking agents first and second in terms of reuse). Due
ing applied to game playing (Mnih et al. 2013).                  to this reason we present the results in terms of summed re-
   We made use of a three layer CNN, with the first layer        ward per-participant. Thus, higher is better. It is possible for
having 8 4x4 filters, the second layer having 16 3x3 filters,    an agent to achieve a negative reward if it places items that
and the final layer having 32 3x3 filters. The final layer is    the participant removed or that correspond with a final -1
a fully connected layer followed by a reshape to place the       reward. Further, it is possible to end up with a summed re-
output in the form of the action matrix (40x15x32). Each         ward of 0 if the agent takes actions that we cannot assign
layer made use of leaky relu activation, meaning that each       any reward. For example, if we know that a human partici-
index of the final matrix could vary from -1 to 1. We made       pant doesn’t want an enemy, but the agent adds a pipe. We
use of mean square loss and adam as our optimizer, with the      cannot estimate reward in this case. Finally, it is possible
network built in Tensorflow (Abadi et al. 2016). We trained      to end with a summed reward much larger than 1.0 given
this model to the point of convergence in terms of training      a large number of actions that encompassed a large amount
set error.                                                       of the level (thus many 40x15x32 testing chunks). The final
                                                                 row indicates the average percentile performance our of the
Pretrained Evaluation                                            maximum possible reward for each participant, since once
                                                                 normalized we can average these results to present them in
For our first evaluation we compared the total reward ac-
                                                                 aggregate.
crued on the test set across our 242 withheld test samples. In
                                                                    The numbers in Table 2 cannot be compared between
comparison we make use of four baselines, the three existing
                                                                 rows given how different the possible rewards and actions
agents and one variation on our approach.
                                                                 of each participant was. However, we can compare between
   For the variation on our approach, we instead trained on
                                                                 columns. For the final row, our approach and the SMB vari-
a dataset created from the existing levels of Super Mario
                                                                 ation are the only two approaches on average to receive pos-
Bros. (SMB), represented in our SMDP format. To accom-
                                                                 itive reward. We note that the Markov Chain partner does
plish this, we derived all 40x15x32 chunks of SMB levels.
                                                                 well for some individuals, but overall has a worse perfor-
We then removed all sprites of each single type from that
                                                                 mance than the LSTM agent. The Bayes net agent may ap-
chunk, which became our state, with the action being the
                                                                 pear to do better, but this is largely because it either predicted
addition of those sprites. We made the assumption that each
                                                                 nothing for each action or something for which the dataset
action should receive a reward of 1, given that it would lead
                                                                 did not have a reward. We note that participant 2 in the Table
to a complete Super Mario Bros. level.
                                                                 received a summed reward of 0.0 for all the approaches, but
   This evaluation can be understood as running these five       this is because that participant only interacted with their two
agents (our approach, the SMB variation, and the three al-       agents once and did not make any deletions.
ready introduced agents) through a simulated interaction
with the held out test set of eleven participants. This is not
a perfect simulation, given that we cannot estimate reward       Active Evaluation
without user feedback. However, given the nature of our re-      The prior evaluation demonstrates that by training on a
ward function, actions that we cannot assign reward to will      dataset or approximated dataset of co-creative interactions
receive 0.0. This makes the final amount of reward each          one can outperform machine learning approaches trained
agent receives a reasonable estimate of how each person          to autonomously produce levels. This suggests these ap-
might respond to the agent.                                      proaches do a reasonable job of generalizing across the
   The second claim we made was that co-creative PLGML           variety of interactions in our training dataset. However,
                                                                  just to a human designer given sufficient interaction.
Table 3: A table comparing two variations on an active learn-
ing version of our agent.                                                      Discussion and Limitations
       participant Ours Episodic Continuous
       0              1.45      1.47         1.47                 In this paper we presented results towards an argument for
       1              1.32     -11.7        -10.1                 co-creative level design via machine learning. We presented
       2              0.00      0.00         0.00                 evidence from a user study and two comparative experi-
                                                                  ments that (1) current approaches to procedural level gen-
       3             -0.53      0.94         1.08
                                                                  eration via machine learning are insufficient for co-creative
       4              0.01     -0.05        -0.25
                                                                  level design and (2) that co-creative level design requires
       5              5.50      5.50        -7.55                 training on a dataset or an approximated dataset of co-
       6              0.29      0.29         0.04                 creative level design. In support, we demonstrate that no cur-
       7              0.10      0.10        -0.04                 rent approach significantly outperforms any of the remaining
       8             -0.14      5.22         0.42                 approaches, and in fact that users are too varied for any one
       9              3.85      42.7         41.0                 model to meet an arbitrary user’s needs. Instead, we antic-
       10            -3.01     -3.76        -4.62                 ipate the need to apply active learning to adapt a general
       Avg %          53.9      56.6         53.1                 model to particular individuals.
                                                                     We present a variety of evidence towards our stated
                                                                  claims. However, we note that we only present evidence in
                                                                  the domain of Super Mario Bros.. Further, while our com-
if designers vary extremely from one another, generaliz-          parative evaluations had strong results, these can only be
ing too much between designers will actively harm a co-           considered simulations of user interaction. In particular, our
creative agent’s potential performance. This second compar-       simulated test interactions essentially assume users will cre-
ative evaluation tests if this is the case.                       ate the same final level, no matter what the AI partner does.
   For this evaluation we create two active learning varia-       To fully validate these results we will need to run a new user
tions of our approach. For both, after making a prediction        study. We anticipate running a follow up study in order to
and receiving reward for each test sample we then train on        verify these results.
that sample for one epoch. In the first, we reset the weights        Beyond a follow-up user study, we also hope to investi-
of our network to the final weights after training on our         gate ways of speeding up the process of creating co-creative
training set after every participant (we call this variation      level design partners. Under the process described in this pa-
“Episodic”). In the second, we never reset the weights, al-       per, one would have to run a 60+ user study with three dif-
lowing the agent to learn and generalize more from each           ferent naive AI partners every time you wanted a co-creative
participant it interacts with (We call this variation “Contin-    level design partner for a new game. We plan to investigate
uous”). If it is the case that user designs vary too extremely    transfer learning and other ways to approximate co-creative
for an approach to generalize between them, then we would         datasets from existing corpora. Further, we anticipate a need
anticipate “Continuous” to do worse, especially as it gets to     for explainable AI in co-creative level design to help the hu-
the end of the sequence of participants.                          man partner give appropriate feedback to the AI partner.
Active Evaluation Results
                                                                                        Conclusions
We summarize the results of this evaluation in Table 3. We        We introduce the problem of co-creative level design via
replicate the results of the non-active learning version of our   machine learning. This represents a new domain of re-
approach from Table 2. Overall, these results support out         search for Procedural Level Generation via Machine Learn-
hypothesis. The average percentile of the maximum pos-            ing (PLGML). In a user study and two comparative evalua-
sible reward increased by roughly three percent from the          tions we demonstrate evidence towards the claim that exist-
non-active version to the episodic active learner, and de-        ing PLGML methods are insufficient to address co-creation,
creased by roughly a percentage point for the continuous          and that co-creative AI level designers must train on datasets
active learner. The continuous active learner did worse than      or approximated datasets of co-creative level design.
either the episodic active learner or our non-active learner
for six of the eleven participants. This indicates that partic-
ipants do tend to vary too much to generalize between, at
                                                                                    Acknowledgements
least for our current representation.                             This material is based upon work supported by the Na-
   Overall, it appears that some participants were more or        tional Science Foundation under Grant No. IIS-1525967.
less easy to learn from. For example, participants 1, 4, and      This work was also supported in part by a 2018 Unity Grad-
10 all did worse with agents attempting to adapt to them dur-     uate Fellowship.
ing the simulated interaction. However, participants 8 and 9
both seemed well-suited to adaption given that their scores                              References
increased over ten times from the non-active learner. This        Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean,
follows from the fact that these two participants had the sec-    J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al.
ond most and most interactions respectively across the test       2016. Tensorflow: A system for large-scale machine learn-
participants. This suggests the ability for these agents to ad-   ing. In OSDI, volume 16, 265–283.
Baldwin, A.; Dahlskog, S.; Font, J. M.; and Holmberg, J.         perspective on mixed-initiative co-creation. Computational
2017. Mixed-initiative procedural generation of dungeons         Intelligence in Games.
using game design patterns. In Computational Intelligence
and Games (CIG), 2017 IEEE Conference on, 25–32. IEEE.
Deterding, C. S.; Hook, J. D.; Fiebrink, R.; Gow, J.; Akten,
M.; Smith, G.; Liapis, A.; and Compton, K. 2017. Mixed-
initiative creative interfaces. In CHI EA’17: Proceedings
of the 2016 CHI Conference Extended Abstracts on Human
Factors in Computing Systems. ACM.
Guzdial, M., and Riedl, M. 2016. Game level generation
from gameplay videos. In Twelfth Artificial Intelligence and
Interactive Digital Entertainment Conference.
Guzdial, M.; Chen, J.; Chen, S.-Y.; and Riedl, M. O. 2017.
A general level design editor for co-creative level design.
Fourth Experimental AI in Games Workshop.
Liapis, A.; Yannakakis, G. N.; and Togelius, J. 2013. Sen-
tient sketchbook: Computer-aided game level authoring. In
Proceedings of ACM Conference on Foundations of Digital
Games, 213–220. FDG.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;
Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Play-
ing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602.
Rohanimanesh, K., and Mahadevan, S. 2003. Learning to
take concurrent actions. In Advances in neural information
processing systems, 1651–1658.
Shaker, N.; Shaker, M.; and Togelius, J. 2013. Ropossum:
An authoring tool for designing, optimizing and solving cut
the rope levels. In Proceedings of the Ninth AAAI Confer-
ence on Artificial Intelligence and Interactive Digital Enter-
tainment.
Smith, G.; Whitehead, J.; and Mateas, M. 2010. Tanagra:
A mixed-initiative level design tool. In Proceedings of the
Fifth International Conference on the Foundations of Digital
Games, 209–216. ACM.
Snodgrass, S., and Ontañón, S. 2014. Experiments in map
generation using markov chains. In FDG.
Snodgrass, S., and Ontanón, S. 2017. Learning to generate
video game maps using markov models. IEEE Transactions
on Computational Intelligence and AI in Games 9(4):410–
422.
Summerville, A., and Mateas, M. 2016. Super mario as
a string: Platformer level generation via lstms. In The 1st
International Conference of DiGRA and FDG.
Summerville, A.; Snodgrass, S.; Guzdial, M.; Holmgård, C.;
Hoover, A. K.; Isaksen, A.; Nealen, A.; and Togelius, J.
2017. Procedural content generation via machine learning
(pcgml). arXiv preprint arXiv:1702.00539.
Summerville, A. 2018. Learning from Games for Generative
Purposes. Ph.D. Dissertation, UC Santa Cruz.
Yannakakis, G. N.; Liapis, A.; and Alexopoulos, C. 2014.
Mixed-initiative co-creativity. In Proceedings of the 9th
Con- ference on the Foundations of Digital Games. FDG.
Zhu, J.; Liapis, A.; Risi, S.; Bidarra, R.; and Youngblood,
G. M. 2018. Explainable ai for designers: A human-centered