=Paper= {{Paper |id=Vol-2282/EXAG_111 |storemode=property |title=GenerationMania: Learning to Semantically Choreograph |pdfUrl=https://ceur-ws.org/Vol-2282/EXAG_111.pdf |volume=Vol-2282 |authors=Zhiyu Lin,Kyle Xiao,Mark Riedl |dblpUrl=https://dblp.org/rec/conf/aiide/LinXR18 }} ==GenerationMania: Learning to Semantically Choreograph== https://ceur-ws.org/Vol-2282/EXAG_111.pdf
                   GenerationMania: Learning to Semantically Choreograph

                                          Zhiyu Lin, Kyle Xiao and Mark Riedl
                                                  Georgia Institute of Technology
                                                  Atlanta, Georgia, United States
                                      {zhiyulin,kylepxiao}@gatech.edu, riedl@cc.gatech.edu




                           Abstract                                to get a high score as well as reconstruct the original music,
                                                                   a player needs to press the correct button at the correct time
  Beatmania is a rhythm action game where players play the
  role of a DJ that performs music by pressing specific con-
                                                                   for playable objects, as well as not press any button when
  troller buttons to mix ”Keysounds” (audio samples) at the        not being instructed. The controller used in this game series
  correct time, unlike other rhythm action games such as Dance     is also unique: It features both 7 buttons and a “turntable”
  Dance Revolution. It has an active amateur Chart (Game           control which the player scratches instead of presses.
  stage) creation community, though chart authoring is con-           A fundamental difference between BMIIDX and many
  sidered a difficult and time-consuming task. We present a        other rhythm action games like DDR is that BMIIDX is con-
  deep neural network based process for automatically gener-       sidered a game with keysounds, which means every object in
  ating Beatmania charts for arbitrary pieces of music. Given      the chart has an audio sample counterpart which plays if and
  a raw audio track of a song, we identify notes according to      only if the corresponding action is executed. This even in-
  instrument, and use a neural network to classify each note
  as playable or non-playable. The final chart is produced by
                                                                   cludes non-playable objects; their actions are automatically
  mapping playable notes to controls. We achieve an F1-score       executed. For comparison, Guitar Hero (Miller 2009) is an-
  on the core task of Sample Selection that significantly beats    other keysound based rhythm action game where each note
  LSTM baselines.                                                  in the game represents a guitar maneuver. In BMIIDX, how-
                                                                   ever, each note can represent an audio sample from different
                                                                   instruments.
                       Introduction                                   These differences in the underlying mechanics yields a
Rhythm action games such as Dance Dance Revolution,                unique paradigm for creating BMIIDX charts. Requiring
Guitar Hero, and Beatmania challenge players to press keys         a clear binding between objects and instrument placement
or make dance moves in response to audio playback. The             based on an underlying score means BMIIDX charts cannot
set of actions timed to the music is a chart and is presented      be overmapped. Overmapping, which happens frequently in
to the player as the music plays. Charts are typically hand-       DDR, describes the situation where patterns of actions are
crafted, which limits the songs available to those that have       unrelated to instruments being played or occur when no note
accompanying charts. Learning to choreograph (Donahue,             is being played by any instrument at that moment. That is,
Lipton, and McAuley 2017) is the problem of automatically          the creation of BMIIDX charts is strictly constrained by the
generating a chart to accompany an a priori unknown piece          semantic information provided by the underlying music. We
of music.                                                          refer to this challenge as “Learning to semantically choreo-
   Beatmania IIDX (BMIIDX) is a rhythm action game, sim-           graph” (LtSC).
ilar to Dance Dance Revolution, with an active commu-                 Due to the strict relationship between chart and music—as
nity of homebrew chart choreographers (Chan 2004). Un-             well as other differences such as charts with several simul-
like Dance Dance Revolution, players play the role of a            taneously actions—the prior approach used to choreograph
DJ and must recreate a song by mixing audio samples by             charts for Dance Dance Revolution (Donahue, Lipton, and
pressing controller buttons as directed by on-screen charts.       McAuley 2017) cannot be used to generate BMIIDX charts.
In BMIIDX, some notes from some instruments are played                We approach the challenge of learning to semantically
automatically to create a complete audio experience. That is       choreograph charts for BMIIDX as a four-part process. (1)
there are “playable” and “non-playable” notes in each song.        we train a neural network to identify the instruments used
A playable object is defined as that which appears visu-           in audio sample files and the timing of each note played by
ally on the chart and is available for players to perform,         each instrument. (2) we automatically label the difficulty of
with a one-to-one correspondence to an audio sample; on            charts in our training set, which we find improves chart gen-
the other hand, a non-playable object is one that is auto-         eration accuracy. (3) We train a supervised neural network
matically played as part of the background music. In order         that translates a musical context into actions for each time
                                                                   slice in the chart. Unlike Dance Dance Convolution (Don-
                                                                   ahue, Lipton, and McAuley 2017), which used an LSTM,
Figure 1: A visualization of a Beatmania IIDX homebrew chart, Poppin’ Shower. Notes in this screen shot are labeled with their
author-created filenames. Note that only objects in the columns starting with A are playable objects that is actually visible to
players; Others are Non-playable objects used as background.


we find a feed forward model works well for learning to se-          Figure 1 shows an example of a BMIIDX homebrew
mantically choreograph when provided a context containing          chart. The objects in the “A” columns are playable objects
the instrument class of each sample, intended difficulty label     with keysounds.
of each sample, the beat alignment, and a summary of prior
instrument-to-action mappings.1 (4) Notes predicted to be          Rhythm Action Game Chart Choreography
playable by the network are mapped to controls and the final
                                                                   There is a handful of research efforts in chart choreogra-
chart is constructed. In addition, we introduce the BOF2011
                                                                   phy for rhythm action games, including rule-based gener-
dataset for Beatmania IIDX chart generation.
                                                                   ation (OKeeffe 2003; Smith et al. 2009) and genetic algo-
                                                                   rithms using hand-crafted fitness functions (Nogaj 2005).
          Background and Related Work                              Dance Dance Convolutions is the first deep neural network
Procedural Content Generation (PCG) is defined as “the cre-        based approach to generate DDR charts (Donahue, Lipton,
ation of game content through algorithmic means”. Machine          and McAuley 2017). Donahue et al. refer to the problem of
learning approaches treat content generation as (a) learning       learning chart elements from data as Learning To Choreo-
a generative model and (b) sampling from the model during          graph. Alemi et al. (2017) suggest that this approach can
creation time (Summerville et al. 2017; Guzdial and Riedl          reach real-time performance if properly tuned.
2016; Summerville and Mateas 2015; Hoover, Togelius, and              Dance Dance Convolutions uses a two-stage approach.
Yannakis 2015; Summerville and Mateas 2016).                       Onset detection is a signal analysis process to determine the
                                                                   salient points in an audio sample (drum beats, melody notes,
Beatmania IIDX                                                     etc.) where steps should be inserted into a chart. Step selec-
The homebrew community of BMIIDX is arguably one of                tion uses a long-short term memory neural network (Hochre-
the oldest and most mature one of its kind (Chan 2004),            iter and Schmidhuber 1997) to learn to map onsets to spe-
with multiple emulators, an open format (Be-music Source2 ,        cific chart elements. BMIIDX chart generation differs from
BMS) and peer-reviewed charts published in semi-yearly             DDR in that the primary challenge is determining whether
proceedings. Despite the community striving to provide the         each note for each instrument should be playable as a stage
highest quality charts, the process of generating such a chart     object or non-playable (i.e., automatically played for the
is considered a heavy workload, and due to the strict se-          player).
mantic bindings, usually the author of the music or a vet-
eran chart author has to participate in the generation of the                                  Data
chart. Many aspiring amateur content creators start by build-
                                                                   We compiled a dataset of songs and charts from the “BMS
ing charts for rhythm action games without keysounds (i.e.
                                                                   Of Fighters 2011” community driven chart creation initia-
Dance Dance Revolution charting is considered by the com-
                                                                   tive. During this initiative, authors created original music
munity to be easier). Furthermore, there is a strong demand
                                                                   and charts from scratch. The dataset thus contains a wide
for customized charts: players have different skill levels and
                                                                   variety of music and charts and was composed by various
different expectations on the music-chart translation, and
                                                                   groups of people. Although the author is not required to cre-
such charts are not always available to them.
                                                                   ate a defined spread of different charts for a single piece of
   1
     A fixed window feed-forward network can often outperform a    music for the event, authors frequently build 3 to 4 charts for
recurrent network when short-term dependencies have more impact    each song. The dataset, which we refer to as “BOF2011”,
than long-term ones (Miller and Hardt 2018).                       consists of 1,454 charts for 366 songs. Out of 4.3M total ob-
   2                                                               jects, 28.7%, or 1.24M of them are playable ones. Table 1
     https://en.wikipedia.org/wiki/Be-Music_
Source                                                             summarizes the dataset.
                                                                 is available, we focus on Sample Classification, Challenge
           Table 1: BOF2011 dataset summary.                     Modeling and Sample Selection which is unique to BMIIDX
           # Songs                  366                          stage generation.
           # Charts                 1,454
           # Charts per song        3.97                         Sample Classification
           # Unique audio samples 171,808
           # Playable objects       1,242,394                    Sample classification is a process by which notes from dif-
           # Total objects          4,320,683                    ferent instruments in the audio samples are identified. The
           Playable object %        28.7                         BMS file format associates audio samples with timing infor-
                                                                 mation. That is, a chart is a set of sample-time pair contains
                                                                 a pointer to the file system where an audio file for the sam-
   We find that modeling the difficulty of charts plays an       ple resides. Unfortunately, in the BMS file format there is
important role in learning to semantically choreograph, an       no standard for how audio samples are organized or labeled.
observation also made by (Donahue, Lipton, and McAuley           However, many authors do name their audio sample files ac-
2017). Many of the charts in our dataset are easier ones, in     cording to common instrument names (e.g., “drums.ogg”).
which non-playable objects dominate. Furthermore, a vast         The goal of sample classification is to label each sample ac-
majority of samples are repeatedly used, such as drum sam-       cording to the instrument based on its waveform. The pre-
ples placed at nearly every full beat throughout a chart, re-    dicted labels will be used to create one-hot encodings for
sulting in only 171k unique audio samples in our dataset.        each sample for the sample selection stage on the pipeline.
The ratio of playable objects to the total objects is not the       We construct a training set by gathering audio samples to-
only factor that determines the difficulty of the chart. Per-    gether with similar instrument names according to a dictio-
ceived difficulty of charts can also be influenced by:           nary and use the most general instrument name as the super-
                                                                 vision label. We use the 27 most common categories for la-
• Added group of notes representing more rhythmic ele-           beling. To ensure that we don’t overfit our classifier we train
   ments;                                                        on an alternate dataset, “BMS of Fighters Ultimate” (BOFU)
• Special placement of a group of notes that requires spe-       that does not share any music or charts with BOF2011, with
   cific techniques to play accurately;                          a partially labeled dataset having a total of 60,714 labeled
                                                                 samples. Not every audio sample have a classifiable name,
• Strain caused by a long stream of high density patterns;
                                                                 which we count as unlabeled samples.
• ”Visual effects”, perspective changes causing suddenly            We decompose the audio samples to their “audio finger-
   accelerating/stopping notes;                                  prints,” which consists of a vectorized spectrogram repre-
• A combination of the above in a small time window.             sentation of the audio using the normalized wave amplitudes
                                                                 over time. We also fix the bit rate of the sound to 16k, so that
   Chart authors label their charts according to difficulty      the representation has a consistent temporal resolution.
level. However, such labels are based entirely on the author’s      For our model, we followed the method described in
perception of the chart difficulty. For example, it is common    (Sainath and Parada 2015). We feed the fingerprints through
for some expert authors’ charts to be labeled as “normal”        two 2D convolutional layers, each with a bias. Each of the
levels despite being more difficult than others’ “difficult”     layers is followed by a Rectified Linear Unit (ReLU) acti-
levels. Although the original Beatmania IIDX labels used         vation function with the next abstracted layer generated via
the monikers “normal”, “hyper”, and “another”, authors can       max-pooling. Finally, we feed the results into another fully
assign any label to describe the difficulty of the chart.        connected layer, which then outputs a one-hot encoding of
                                                                 the predicted category. We use a gradient descent optimizer
                         Methods                                 with 50% dropout rate and a step size of 0.01.
Our chart generation system for BeatMania IIDX, which we            After training on the BOFU dataset, we achieved an 84%
call GenerationMania, uses a pipeline consisting of the fol-     accuracy based on a 10% testing set.
lowing tasks:
1. Sample Classification — Identifying the instrument used       Challenge Modeling
   in audio samples;                                             We compute the difficulty level of each object in each chart
2. Challenge Modeling — Establish structure of each part in      in the training set. We use a rule-based technique for assess-
   the chart;                                                    ing the difficulty of each object in a chart. This technique
                                                                 is adapted from the Osu! rhythm action game.3 The diffi-
3. Sample Selection — Classifying audio samples into             culty for each object of a given chart is weighted sum of
   playable and non-playable;                                    the individual strain and the overall strain. Individual strain
4. Note Placement — Assigning controls to each playable          calculates as the interval between keysounds mapped to the
   keysound.                                                     same control on an exponetiated scale such that short inter-
                                                                 vals have exponetially higher strain values than long inter-
We realize that the task of generating a music score from
raw audio (which is essentially Audio Segmentation) can             3
                                                                    https://github.com/ppy/osu/blob/
be well decoupled from generating BMIIDX charts from a           master/osu.Game.Rulesets.Mania/Difficulty/
music score. Based on the assumption that the music score        ManiaDifficultyCalculator.cs
                                            Figure 2: The GenerationMania pipeline.


vals. Overall strain calculates as the number of controls that    64, 32, 16, and 2. To perform sample selection, we pick
must be activated simultaneously. In addition, challenging        the output node with the highest activation corresponding
patterns have prolonged effects on both strain values for ob-     to playable or non-playable. Due to the class imbalance
jects directly after them. For each 0.4 second window, the        that most of the objects are non-playable we found that a
maximum strain value becomes the difficulty of that win-          weighted Mean Squared Error loss function helps improv-
dow and every object receives that difficulty. An overall dif-    ing the performance of the training.
ficulty of the chart is generated by weighted sum of highest         At training time the difficulty curve is derived from the
local maximum strain values throughout such chart.                dataset so that that sample selection network can be trained
                                                                  to reconstruct the input data. At generation time, the diffi-
Sample Selection                                                  culty curve can be provided by the user.
Sample selection is a task of determining which objects in
the music should be playable and which should be non-             Note Placement
playable. The input features for each object are as follows:      For each object at each timestep that has been classified as
• Difficulty: A 1 × 1 value of difficulty from the difficulty     playable, we map it to one of 8 controls. Any process that
   curve;                                                         doesn’t map objects to the same control at the same time
                                                                  is sufficient to make a chart playable, thus note placement is
• Audio Features: A 1 × 27 one-hot representation of the          not a significant contribution in this paper. We created a sim-
   instrument class that the audio sample belongs to;             ple module that uses the same framework as Sample Selec-
• Beat Alignment: A 1 × 1 value ranging from 0 to 15 rep-         tion but trained to predict note placements as labels instead
   resenting which 16th section out of a beat this note re-       of playability. A post-processing step checks and rearranges
   sides in, with 0 representing the note on a beat. This in      the chart so that we never map two objects that occur in too
   most occasions represents a per-64th-note granularity in a     short interval to the same control.
   chart having 4/4 time signature, which is around 25 mil-
   liseconds on a chart at 150 BPM (Beat Per Minute);                                    Experiments
• Summary: A 1 × 270 vector summarizes the playability            In this section, we evaluate variations of our sample selec-
   of different samples prior to the current object. For each     tion model against a number of baselines. We used a super-
   instrument class, a 1 × 2 vector gives the probability that    vised evaluation metric: by embedding challenge model ex-
   that instrument was playable or non-playable in a given        tracted from the ground truth, we measure how similar the
   window of time, as computed by the number of times it          generated chart is compared to the original. We establish two
   was playable/non-playable divided by the number of ap-         guidelines for a good generation model: it should not only
   pearances. This gives a 1 × 54 vector for a time window.       predict playables when they should be presented to players
   Five different time windows are provided covering 2, 4, 8,     (high recall), but also nonplayables when they should be in
   16, and 32 beats.                                              the background (high precision).
Summarization is a technique popularized by                          We applied 80%,10%,10% split on training, validation
WaveNet (Oord et al. 2016) to factor prior information            and testing data. Since the charts for the same music shares
from a fixed window at different time scales into a cur-          similar traits, we ensured that such charts are not both in the
rent prediction. At training time and inference time, the         training split and the testing split. We trained all the mod-
summary information is derived from the training data,            els using the training split and reported all the results on the
except for the self-summary baseline, for which summary           testing split.
information is based on previous generation results.                 We use the following baselines:
   Our sample prediction model is a feed-forward network          • Random: classifies a given object as a playable with a
consisting of 4 fully connected ReLU layers of dimensions            probability of 0.3, chosen to give the best result;
• All Playable: classifies all objects as a playable;
• LSTM baseline: a sequence to sequence model
  (Sutskever, Vinyals, and Le 2014) with forget gates and
  a hidden and output layer size of 2. The highest activated
  output is selected as the prediction.
The LSTM baseline was chosen because Dance Dance Con-
volution (Donahue, Lipton, and McAuley 2017) used an
LSTM network. However, it is impossible to directly com-
pare the approach used in DDC to our feed forward model
because the task and the inputs between Dance Dance Rev-
olution (non-keysound based game) and BeatMania IIDX
(keysound-based game) are different enough that substantial
changes to the DDC algorithm are required.
   Our feed-forward sample selection model and the LSTM
baseline are configured with different combinations of input
features drawn from: audio features (instrument labels), dif-
ficulty curve, beat alignment, and summary vectors. We also     Figure 3: The performance of feed forward model (with
experiment with the use of summary vectors; we refer to the     summary) regarding difficulty of the ground truth chart.
models without summary inputs as “free generation”. There
is one special case of free generation model in which we al-
low the model to self-summarize. That is, we use summary        boosts. The summary representation presented provided the
data based on what has been generated earlier in the chart.     best boost to both the feed forward model and the LSTM.
   For all neural network modules, we learn parameters by       Additionally, we considered an auto-encoder structure for
minimizing a weighted mean squared error (MSE) with             LSTM model, which tries to auto-summarize the chart. We
weight of 1 to playables and 0.2 to non-playables. We used      also considered multi-layer LSTM structures like in (Don-
a mini-batch of 128 for the feed forward model; Due to the      ahue, Lipton, and McAuley 2017). However, these models
need of processing very long sequences, the LSTM model is       either overfit quite quickly or have unrealistic computational
trained by each sequence and is run in CPU mode. We stop        requirements.
training when the loss converges. The feed forward model           In our feed-forward model with summary, per-object dif-
satisfies this criteria in around 6 hours in GPU mode while     ficulty information accounts for a 7.7% improvement in the
the LSTM model takes far longer at around 100 hours, on         F1-score. As with (Donahue, Lipton, and McAuley 2017),
a single machine using Intel i7-5820K CPU and NVIDIA            we also observe that all generators varied in performance
GeForce 1080 GPU.                                               on charts with different difficulty. We analyzed the effect of
                                                                chart difficulty on our best performing model, the feed for-
                          Results                               ward model with summary. We sorted all charts in the test-
                                                                ing set by their difficulty then examined single-chart perfor-
Following the two guidelines we pointed out for evaluating
                                                                mance. The result is summarized it in figure 3. We observed
the models, we report the performance of these Sample Se-
                                                                a larger variance of performance in easier charts, and a stable
lection models using precision, recall, and F1-score — a har-
                                                                performance on harder charts.
monic mean of them — since they are both very important.
We calculate each metric per chart, then report the average
of these values.                                                            Discussion and Future Work
   The results are shown in Table 2. Without summaries,         A side effect of how beat-phase information is organized in
the LSTM baseline performs the best, with a precision           our specific task is that we were unable to include ∆-beat
approaching that of models provided with summary data.          information in our models. ∆-beat is a feature that mea-
LSTMs make use of history while free generation feed-           sures the number of beats since the previous and until the
forward models can only make decisions based on the cur-        next step. They were used in DDC (Donahue, Lipton, and
rent time step. With summary embedded, all the models re-       McAuley 2017). However, a naive approach of “finding the
ceive a significant boost in their performance. Notably, the    next note” will not work for our task. This is mainly because
feed-forward models (ours) with summary data has a higher       (1) several semantically unrelated notes can be placed at the
recall, which means it produces much fewer false negatives.     exact same time and (2) notes can be placed in very short in-
The LSTM baseline also improved with summary data but           tervals (such as when representing a glissando). These issues
is hampered by low recall. Although LSTM are usually used       prevent effective ∆-beat detection in granularity of single
with generating long sequence of data, we believe it is be-     note. Perhaps grouping notes based on their musical seman-
cause lack of data and high variance in each sequence caus-     tic relations can be a solution to this.
ing it to perform worse in our task.                               Our Challenge Model technique is relatively simplistic
   The information contained in the summary plays a role        and there is room for expansion. The key assumption of
in the performance. We tried several different ways to cre-     this model is that for a given arrangement of objects, every
ate summaries which gave us different levels of performance     player perceives exactly the same level of challenge. How-
              Table 2: Results for playable classification experiments, presented in mean and standard deviation.
                                    Model                                  F1-score          Precision          Recall
        Reference Baselines
        Random                                                          0.291 ± 0.089     0.335 ± 0.200     0.299 ± 0.020
        All Playable                                                    0.472 ± 0.207     0.335 ± 0.199     1.000 ± 0.000
        Free Generation Models
        FFAudio Features + Difficulty Curve + Beats                     0.253 ± 0.143     0.523 ± 0.266     0.179 ± 0.113
        FF Audio Features + Difficulty Curve + Beats + Self Summary     0.368 ± 0.198     0.422 ± 0.213     0.392 ± 0.258
        LSTM + Audio Features + Difficulty Curve + Beats                0.424 ± 0.154     0.767 ± 0.176     0.353 ± 0.248
        Generation with Summary
        FF Audio Features + Beats + Summary                             0.621 ± 0.206      0.760 ± 0.110     0.568 ± 0.254
        FF Audio Features + Difficulty Curve + Beats + Summary         0.698 ± 0.162       0.778 ± 0.112    0.649 ± 0.197
        LSTM + Audio Features + Difficulty Curve + Beats + Summary      0.499 ± 0.225     0.805 ± 0.121      0.405 ± 0.237


                                                                         Aside from that, the Challenge Model and summary can
                                                                      be extracted from charts provided by players to allow for
                                                                      a degree of controllability of the system. Our feed forward
                                                                      model even allows generation on-the-fly. This make it possi-
                                                                      blefor our pipeline to be used in tasks such as dynamic chal-
                                                                      lenge adaptation, where challenge level of the stage changes
                                                                      based on player’s performance and preference (Zook and
                                                                      Riedl 2015) and style transfer, where two charts blend with
                                                                      each other(Johnson, Alahi, and Fei-Fei 2016). Furthermore,
                                                                      a Challenge Model that is human-understandable allows
                                                                      player to easily manipulate it to their will, which in turn
                                                                      may facilitate human participation in this process, allow-
                                                                      ing Computational Co-creativity applications which would
                                                                      be especially helpful to content creators. We don’t know if
                                                                      our system meets the player’s expectations yet; We leave all
                                                                      of these as future work.

Figure 4: Around 6 seconds of playable classification result
compared to ground truth human Chart for Poppin’ Shower.
An activation level higher than 0.5 is considered classifica-
                                                                                            Conclusions
tion of playable. Blue dots identifies correct predictions, yel-
low dots identifies incorrect ones. On this song, we achieved         Choreographing Rhythm Action Game stages is a chal-
0.824 F-Score with our feed forward model.                            lenging task. BMIIDX added more challenge on top of it
                                                                      by posing extra semantic constraints by requiring one-to-
                                                                      one audio-sample-to-playable-object relation. We have es-
ever, it is possible that players have different playing level,       tablished a pipeline for Learning to Semantically Choreo-
and they have individual differences. This causes problem             graph, provided a dataset for reproducible evaluations, and
on evaluating challenge level of asymmetric and/or hand-              showed that a feed forward neural network model with chal-
biased patterns since every control is treated exactly the            lenge modeling and summary information performs well on
same. A derivation of this assumption is that “easy” charts           satisfying these new constraints. We further discuss how
should be treated the same as harder charts, which proves to          users can inject a degree of control over the algorithm by
be particularly problematic and may be a cause of poor gen-           inputting a customized or manually edited difficulty curve
eration performance on “easy” charts. We observed that un-            and biasing the summary information.
like harder charts, many “easy” charts are designed for new-             Learning to semantically choreograph is essential to gen-
comers to the game, which in turn have reduced challenging            erating keysound based game charts. However, incorporat-
artifacts and focused notes representing only the melody of           ing semantics may potentially also be used to improve gen-
the music. This results in drastically different charting style,      eration on non-keysound based games such as Dance Dance
which may explain why our Sample Selection classifier have            Revolution, where it is possible to overmap actions and
poorer performance on them. Because the Challenge Model               still achieve high accuracy according to automated metrics.
was hand-authored using a particular dataset (Osu! stages),           Aside from solving a challenging creative task, intelligent
its performance on a different dataset may deteriorate. The           systems such as GenerationMania can be of benefit to home-
Challenge Model is also sensitive to parameter tuning. A              brew chart choreography communities by overcoming skill
model-free approach or a player experience based system               limitations. The ability to control the generative process is
may help in this scenario.                                            an essential part of the adoption of such systems.
                       References                                Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to
Alemi, O.; Françoise, J.; and Pasquier, P. 2017. GrooveNet:     Sequence Learning with Neural Networks. In Ghahramani,
Real-Time Music-Driven Dance Movement Generation us-             Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Wein-
ing Artificial Neural Networks. networks 8(17):26.               berger, K. Q., eds., Advances in Neural Information Process-
                                                                 ing Systems 27. Curran Associates, Inc. 3104–3112.
Chan, A. 2004. CPR for the Arcade Culture.
                                                                 Zook, A., and Riedl, M. O. 2015. Temporal game challenge
Donahue, C.; Lipton, Z. C.; and McAuley, J. 2017. Dance          tailoring. IEEE Transactions on Computational Intelligence
Dance Convolution. In Proceedings of the 34th Interna-           and AI in Games 7(4):336–346.
tional Conference on Machine Learning.
Guzdial, M., and Riedl, M. 2016. Game level generation
from gameplay videos. In Twelfth Artificial Intelligence and
Interactive Digital Entertainment Conference.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term
memory. Neural computation 9(8):1735–1780.
Hoover, A. K.; Togelius, J.; and Yannakis, G. N. 2015. Com-
posing video game levels with music metaphors through
functional scaffolding. In First Computational Creativity
and Games Workshop. ACC.
Johnson, J.; Alahi, A.; and Fei-Fei, L. 2016. Percep-
tual losses for real-time style transfer and super-resolution.
In European Conference on Computer Vision, 694–711.
Springer.
Miller, J., and Hardt, M.          2018.     When Recurrent
Models Don’t Need To Be Recurrent. arXiv preprint
arXiv:1805.10369.
Miller, K. 2009. Schizophonic Performance: Guitar Hero,
Rock Band, and Virtual Virtuosity. Journal of the Society
for American Music 3(4):395429.
Nogaj, A. F. 2005. A genetic algorithm for determining
optimal step patterns in Dance Dance Revolution.
Oord, A. v. d.; Dieleman, S.; Zen, H.; Simonyan, K.;
Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; and
Kavukcuoglu, K. 2016. WaveNet: A Generative Model for
Raw Audio. In SSW, 125.
OKeeffe, K. 2003. Dancing monkeys. Masters project 1–66.
Sainath, T. N., and Parada, C. 2015. Convolutional neural
networks for small-footprint keyword spotting. In Sixteenth
Annual Conference of the International Speech Communi-
cation Association.
Smith, G.; Treanor, M.; Whitehead, J.; and Mateas, M. 2009.
Rhythm-based Level Generation for 2D Platformers. In Pro-
ceedings of the 4th International Conference on Foundations
of Digital Games, FDG ’09, 175–182. New York, NY, USA:
ACM.
Summerville, A., and Mateas, M. 2015. Sampling Hyrule:
Sampling Probabilistic Machine Learning for Level Genera-
tion. In Conference on Artificial Intelligence and Interactive
Digital Entertainment.
Summerville, A., and Mateas, M. 2016. Super mario as a
string: Platformer level generation via lstms. arXiv preprint
arXiv:1603.00930.
Summerville, A.; Snodgrass, S.; Guzdial, M.; Holmgård, C.;
Hoover, A. K.; Isaksen, A.; Nealen, A.; and Togelius, J.
2017. Procedural Content Generation via Machine Learn-
ing {(PCGML)}. CoRR abs/1702.0.