=Paper= {{Paper |id=Vol-2862/paper8 |storemode=property |title=Synthesizing Retro Game Screenshot Datasets for Sprite Detection |pdfUrl=https://ceur-ws.org/Vol-2862/paper8.pdf |volume=Vol-2862 |authors=Chanha Kim,Jaden Kim,Joseph C. Osborn |dblpUrl=https://dblp.org/rec/conf/aiide/KimKO20 }} ==Synthesizing Retro Game Screenshot Datasets for Sprite Detection== https://ceur-ws.org/Vol-2862/paper8.pdf
             Synthesizing Retro Game Screenshot Datasets for Sprite Detection

                                      Chanha Kim, Jaden Kim, Joseph C. Osborn
                                           Formal Analysis of Interactive Media Lab
                                                        Pomona College
                                                       185 East 6th Street
                                                 Claremont, California 91711
                                      {chanha.kim, jaden.kim, joseph.osborn}@pomona.edu



                           Abstract
  Scenes in 2D videogames generally consist of a static terrain
  and a set of dynamic sprites which move around freely. AI
  systems that aim to understand game rules (for design support
  or automated gameplay) must be able to distinguish moving
  elements from the background. To this end, we re-purposed
  an object detection model from deep learning literature, de-
  veloping along the way YOLO Artificial Retro-game Data
  Synthesizer, or YARDS, which efficiently produces semi-
  realistic, retro-game sprite detection datasets without man-      Figure 1: Real (left) vs. Synthetic (right) Screenshot from
  ual labeling. Provided with sprites, background images, and
  a set of parameters, the package uses sprite frequency spaces
                                                                    Super Mario Bros. on NES
  to create synthetic gameplay images along with their corre-
  sponding labels.
                                                                    resources to train sprite detection models for their corre-
                                                                    sponding games, which should generalize better than classi-
                       Introduction                                 cal computer vision techniques or emulator instrumentation.
Many videogames employ a visual language which presents                While our approach still requires that users collect
a static terrain (with background and foreground elements)          spritesheets and background images, it eliminates the need
juxtaposed with dynamic, animated, and freely moving                for manually labeling images or writing computer vision
sprites. Sprites are often game characters or objects of in-        code. We have packaged this generator in a Python pack-
terest: potential threats, powerups, or the player’s charac-        age called YOLO Artificial Retro-game Data Synthesizer
ter. Knowledge about these sprites (e.g. their type, loca-          (YARDS). YARDS is a command-line tool which, given
tion, or speed) is vital for AI systems meant to understand         sprites and background images, generates synthetic train-
games, especially in systems such as automated game de-             ing images that mimic patterns found in real game screen-
sign learning (Osborn, Summerville, and Mateas 2017) and            shots. YARDS pre-formats the synthetic data for integration
learning-based level generation (Guzdial and Riedl 2016;            with YOLO and can generate 1,000 labeled images in 10
Summerville et al. 2016a) as well as general videogame              seconds—a task which took the authors 10 long hours!
playing and automated approaches to accessibility.                     In this paper, we present our two key contributions. First,
   Besides manual image labeling, current methods for ob-           we apply recent progress made in deep learning and com-
taining sprite segmentations generally involve game-specific        puter vision to aid in producing semi-realistic datasets suit-
image processing or deep instrumentation of game emula-             able for training sprite detection models. Second, we in-
tors, as in CHARDA (Summerville, Osborn, and Mateas                 troduce a software package that can rapidly generate large
2017). These techniques can be difficult to generalize and          datasets. In the following sections, we will discuss our
may be expensive, slow, or potentially fragile (e.g. the loca-      approach for synthesizing retro-game screenshots, demon-
tions of hardware sprites in memory do not correspond ex-           strate how YARDS works, and evaluate several models
actly to the positions of human-legible sprites). In this work,     trained on synthetic data against those trained on manually
we instead generate synthetic images starting from readily          labeled images.
accessible spritesheets and sprite-free background images.
We can then use the synthetic datasets generated from these                              Related Works
Copyright c 2020 for this paper by its authors. Use permitted un-   The intersection of computer vision and games is a growing
der Creative Commons License Attribution 4.0 International (CC      area of research that can benefit both designers and play-
BY 4.0).                                                            ers. In 2018, Luo et al. demonstrated how transfer learning
                               Figure 2: Examples of Synthetic Images Generated with YARDS


                                                                 datasets in niche application domains (Wong et al. 2019)
                                                                 and the under-representation of the full target distribution
                                                                 in small datasets (Lateh et al. 2017). For such reasons, re-
                                                                 searchers have suggested numerous approaches to generat-
                                                                 ing synthetic data over the past decade (Nikolenko 2019;
                                                                 Seib, Lange, and Wirtz 2020).

                                                                    One common approach is to first generate a synthetic im-
                                                                 age and then stylize the image to be more realistic (Dwibedi,
Figure 3: Adapted diagram of the original YOLO architec-         Misra, and Hebert 2017; Georgakis et al. 2017; Wong et al.
ture (Koylu, Zhao, and Shao 2019)                                2019). For example, Wang et al. 2019 generated photoreal-
                                                                 istic synthetic images using a virtual 3D object-environment
                                                                 reconstruction method and style transfer techniques. Hinter-
could improve the task of extracting player experiences di-      stoisser et al. 2019 used 3D CAD models and pose curric-
rectly from gameplay videos. In the same year, Zhang et al.      ula to generate foreground-background compositions, then
introduced the problem of content-based retrieval of game        made those compositions photorealistic via rendering tech-
moments and presented a prototype search engine for re-          niques.
trieving such moments based on user-provided game screen-
shots (2018). Our research contributes to this area by apply-       A significant issue arising from synthetic datasets is the
ing synthetic data generation techniques to the task of train-   synthetic-to-real domain gap (Tremblay et al. 2018; Yun et
ing sprite detection models and by introducing a software        al. 2019b). This gap occurs when the synthetic images used
package that supports end-users with synthesizing their own      to train a model are not representative of the target image
datasets.                                                        distribution. In terms of model performance, research shows
   In our research, we use YOLO (You Only Look Once) to          that models trained with both real and synthetic images
detect sprites in four retro games: Super Mario Bros. (NES),     achieve the best performance, followed by models trained
Earthbound (SNES), Super Street Fighter II (SNES), and           with purely real images (Rozantsev, Lepetit, and Fua 2015;
Super Mario World (SNES). YOLO is a well-known ob-               Dwibedi, Misra, and Hebert 2017; Georgakis et al. 2017;
ject detection model (Redmon et al. 2015) which has              Rajpura, Bojinov, and Hegde 2017; Yun et al. 2019a). These
seen several iterations (Redmon and Farhadi 2016; 2018;          studies also demonstrate that training with purely synthetic
Bochkovskiy, Wang, and Liao 2020). The original model’s          images seems to detract from model performance. However,
architecture is inspired by GoogLeNet (Szegedy et al. 2014)      comparable performance is achievable in image segmenta-
and has 24 convolutional layers followed by 2 fully con-         tion tasks (Di Cicco et al. 2017) and fine-tuning models
nected layers (see Fig. 3). This end-to-end model allows for     trained on synthetic data with additional real images can
efficient training and detection of objects in both images and   yield better performance than mixed training (Nowruzi et
videos. The specific YOLO version that we are using is Ul-       al. 2019).
tralytic’s YOLOv5 (Ultralytics 2020), a recent implementa-
tion of YOLO in PyTorch.                                            Our synthetic data generation approach is most similar to
   The use of synthetic data to train models is a well-          the one presented by Dwibedi et al. 2017. Since our games
known practice in computer vision. Motivations for gener-        of interest use a pixelated art style, we can skip the step of
ating synthetic datasets include the high cost of manually       increasing the realism of the generated images; it is enough
labelling real images (Roig et al. 2020), privacy concerns       to simply paste the sprites onto the backgrounds in a rea-
when using real user data (Triastcyn and Faltings 2018;          sonable distribution. We therefore focus on developing an
Shaked and Rokach 2020), and the shortage of real train-         efficient technique for pasting sprites onto background im-
ing examples for rare cases (Beery et al. 2019). Other argu-     ages based on sprite frequency distributions observed in real
ments for using synthetic data include the lack of extensive     gameplay images.
      Synthetic Data Generation Approach
Our approach involves two steps: first, collect the sprites
and background images for a given game; and second, paste
sprites onto background images using their frequency dis-
tributions. Both sprites and background images are easily
obtainable by extracting the data from an emulator, borrow-
ing from archives compiled by fans, or utilizing approaches
proposed by researchers like Summerville et al. 2016b. Ad-
ditional methods for obtaining sprites and background im-
ages include scripting some game-specific image extraction
code or providing the assets directly if the user is the one
developing the game.
   One reason why our approach is so effective in the
videogame domain is because we are working with much
smaller image spaces (sets of possible images) than the
real-world image spaces typically used in computer vision            Figure 4: Example of sprite clipping on worst-case scenario
tasks. While the immense complexity of real-world images             of L-shaped sprite in bottom left corner of screen. Midpoints
can depend on virtually anything, from lighting conditions           are where sprite will be cropped.
to object textures, our focus on low-resolution, retro-game
screenshots allows us to synthesize realistic screenshots just
by pasting sprites onto background images.

Sprite Frequency Spaces
Even though we are working with low-resolution images,
synthesizing images that roughly mimic those seen in real
gameplay is a nontrivial task. We generate synthetic images
by providing the sprite frequency space for each class of
sprites to be detected. We define the sprite frequency space
for a given class as the discrete probability distribution over
the frequency of appearances for that class on a given game-         Figure 5: Example of top-left, top-right, bottom-left, and
play image. That is, it is a function mapping the numbers of         bottom-right quadrants in a clipped sprite
sprites in a class to the probabilities that those numbers of
sprites actually appear on a screenshot.
   These sprite frequency spaces can either be defined by the
                                                                     the lower left corner of the screen might have no visible pix-
user (as in our reported results) or approximated by feed-
                                                                     els (i.e., opaque pixels) in the screenshot (see Fig. 4). Train-
ing pre-labeled images into YARDS. At runtime, YARDS
                                                                     ing a model would be very difficult if our training set asserts
will approximate the sprite-frequency spaces by counting
                                                                     that background pixels are in fact part of a character. To han-
the frequencies of the desired sprite classes in the image la-
                                                                     dle these edge cases, we propose an approach that relies on
bels. Using these sprite frequency spaces, we can determine
                                                                     transparency quadrants for determining whether enough of
the number of times each sprite class should appear on each
                                                                     the sprite is visible in the screenshot.
generated image. For example, in Super Mario Bros. (NES),
there should almost always be one player on the screen,                 For a given sprite, we first determine its transparency
while there may be any number of enemy sprites from zero             quadrants by using the locations of the first and last visi-
to 16, each number appearing at a different rate.                    ble (non-transparent) pixels along each of its border axes.
   For each output image, we choose the number of times              The upper-left quadrant (Fig. 5) is determined by the first
each class of sprites should appear by sampling the class’s          visible pixel in the first row of visible pixels and the first
corresponding sprite frequency space. We thereby incorpo-            visible pixel in the first column of visible pixels. Corre-
rate the given or estimated frequencies with which our tar-          sponding horizontal and vertical lines are drawn from each
get sprite classes appear in authentic gameplay images. This         pixel, and their intersection defines the quadrant. Similarly,
allows the YOLO model to train on data that roughly mim-             the bottom-right quadrant (Fig. 5) is determined by the last
ics our target distribution and avoid the previously-discussed       visible pixel in the last row of visible pixels and the last vis-
domain gap issues.                                                   ible pixel in the last column of visible pixels. The upper-
                                                                     right and bottom-left quadrants are formed analogously. In
                                                                     general, the quadrant is defined by the intersection of the
Edge Handling with Transparency Quadrants                            horizontal and vertical lines drawn from each relevant pixel.
Given that sprites often have transparent pixels, we must en-           Once we have the sprite’s transparency quadrants, we de-
sure that sprites are at least partially visible in screenshots no   termine where the sprite is clipped by the boundaries of the
matter their shape. For example, an L-shaped sprite placed in        screenshot. For example, if it is partially off the left side of
                                            Table 1: YARDS Configuration Parameters
    game title                  The game’s title, which is prepended to each image’s filename to avoid naming conflicts.
    num images                  The total number of images to generate.
    train size                  The proportion of total images which should be included in the train set.
    mix size                    The proportion of total real images which should be included in the train set.
    label all classes           Determines whether all classes should be labeled or if only specific classes should be.
    labeled classes             Determines which classes to label if label all classes is false. Useful for focusing
                                attention on a single sprite and introducing noise in the form of other sprites or random
                                images.
    max sprites per class       The maximum number of sprites per class which can appear in any given image. If set to
                                -1, no cap will be set. Provides a means for limiting noise. Useful primarily when setting
                                classification scheme to random, as it allows for more control of the distribution.
    transform sprites           Another means for introducing noise. If set to true, transforms sprites by rotating a multiple
                                of ninety degrees, mirroring, or scaling to twice their original size. The reason for the set
                                scaling is because pixel art gets distorted by any non-double scaling.
    clip sprites1               Determines whether to keep all sprites entirely on screen or to allow some sprite clipping.
    classification scheme2      Determines the classification scheme by which to place sprites.


the screenshot, we check the right-most quadrants (Fig. 5).          what each parameter does, and the complete documentation
Then, if the larger width of the two quadrants (i.e., the trans-     is available in the project’s source code repository.
parency space) is greater than the width of the sprite that is          clip_sprites1 and classification_scheme2
visible after clipping (i.e., the clipping space), not enough        are the parameters that control our synthetic data gener-
of the sprite is within the boundaries of the screenshot. For        ation approach. clip_sprites1 determines whether or
instance, assume the transparency space is greater than or           not the synthetic screenshots should have clipped sprites.
equal to the clipping space for a sprite being clipped off the       classification_scheme2 accepts one of four key-
left of an image. In this case, we average the x-positions           words that define different methods for characterizing the
of the inner vertical edges of all four transparency quad-           sprite frequency space: distribution, mimic-real,
rants. Then, we crop the sprite to be from the resulting av-         random, and discrete. The distribution method
erage x-position to the sprite’s rightmost border and paste          takes a set number of predefined classes such as player,
the cropped sprite into the screenshot with the sprite’s left        enemy, or item and corresponding sprite frequency spaces
border aligned with the screenshot’s left border. Because we         for each class, represented by an array. For instance,
use all four transparency edges to determine how to crop             player: [0.20, 0.40, 0.40] means that for the
the sprite, we know that enough of the sprite’s useful infor-        player class, zero sprites should appear twenty percent
mation will appear on the generated image. We perform an             of the time, one sprite should appear forty percent of the
analogous procedure for each screen boundary that the sprite         time, and two sprites should appear forty percent of the time.
overlaps.                                                            The mimic-real method analyzes a set of pre-labeled
                                                                     images to approximate the sprite distribution in a dataset
                        Using YARDS                                  and takes as input an array of class numbers, which corre-
Integrating YARDS into object detection projects is simple.          spond to the class numbers in the image labels. It then uses
The development pipeline with YARDS involves three main              the approximated distributions to generate the images. The
steps: preprocessing, synthetic data generation, and model           random method samples each class with a uniform distri-
training. In preprocessing, we gather sprite and background          bution, given the maximum number of sprites for each class.
images for a given game and define the corresponding folder          The discrete method takes inspiration from games like
locations in the configuration file. During synthetic data gen-      Street Fighter II where each screen has a constant number
eration, we define the parameters in the rest of the configu-        of sprites, and it takes a constant number of sprites to dis-
ration file and run YARDS via command line to generate the           play on each screenshot.
images. The command-line package for YARDS takes up to
two parameters. The configuration parameter (--config                                    Tests and Results
or -c) defines the path to the configuration file, and the vi-
                                                                     To evaluate our dataset synthesizer we compare model per-
sualize parameter (--visualize or -v) tells the package
                                                                     formance for different datasets from a single game, train
to draw bounding boxes for a sample of images. Finally, in
                                                                     binary classifiers to detect real versus synthetic data, and
model training, we train the model and validate it in a con-
                                                                     demonstrate the generalizability of our approach.
ventional machine learning pipeline.

Configuration Parameters                                             Training on Synthetic, Real, and Mixed Datasets
YARDS has multiple configuration parameters that the user            To compare the performance of models trained on synthetic
must define prior to using the package. Table 1 summarizes           data to those trained on real data, we trained nine YOLO
                   Figure 6: Detection Results of Models Trained on Various Datasets for Super Mario Bros.


models on various datasets for Super Mario Bros. (screen-
shot dimension: 256×192). Each YOLOv5 model (Ultralyt-           Table 3: Accuracy of binary classifiers trained to classify
ics 2020) was trained for 200 epochs with batch-size 32, and     synthetic versus real Super Mario Bros. gameplay images
the best weights for each model (i.e. the weights that yielded              Model                     Accuracy
the best model performance in training) were validated on                   LeNet-5                     0.5000
250 real images. We used mAP@0.5, a mean average preci-                     AlexNet                     0.8490
sion metric for measuring the performance of object detec-                  ResNet-50                   0.9690
tion models, as our single-valued evaluation metric.

                                                                 Table 4: Mean average precision of datasets containing mix-
Table 2: Mean average precision of YOLO models trained           ture of generated data from Super Mario Bros., Super Street
on various datasets for Super Mario Bros. (Better-than-          Fighter II, and Earthbound
baseline performance is in bold.)                                            Dataset                  mAP@0.5
   Training Set Composition                mAP@0.5                           60k imgs w/ clipping       0.9525
   750 real (baseline)                      0.856                            60k imgs w/o clipping      0.9635
   375 real + 375 synthetic                 0.886
   75 real + 675 synthetic                  0.841
   750 synthetic                            0.677                sus synthetic images. Each architecture was modified to take
   750 synthetic + fine-tuning w/ 750 real  0.881                in inputs of 256 × 256 × 3, configured with binary cross-
   375 real + 3,375 synthetic               0.956                entropy loss and the Adam optimizer, and trained for 50
   3,750 synthetic                          0.816                epochs with batch-size 64. LeNet-5 was modified to use
   500 real + 9,500 synthetic               0.963                max-pooling and ReLU activation. We trained and tested
   10,000 synthetic                         0.935                the classifiers on 4000-image datasets composed of real and
                                                                 synthetic training images from Super Mario Bros.
   Table 2 confirms the results shown by researchers in other       Contrary to our original hypothesis that each model would
image recognition domains, suggesting that training mixed        achieve roughly a 50% accuracy, Table 3 suggests that archi-
datasets of real and synthetic images yields the best model      tectures with many weights (e.g. AlexNet and ResNet-50)
performance—although models trained on 3,750 and 10,000          are able to distinguish between real and synthetic images,
synthetic images show that training with large purely syn-       while smaller ones like LeNet-5 are not. We therefore need
thetic datasets can also work well. The model trained on         to develop a training strategy to account for the discrepancy
750 synthetic images also improved significantly after fine-     between synthetic and real data; in the future, we may be
tuning with 750 real images for 41 epochs. We initialized        able to leverage these binary classifiers to guide further im-
fine-tuning to train the model’s weights for 200 epochs, but     provements to our synthetic data generation approach.
the package fast-forwarded the fine-tuning process to the
last 41 epochs. Based on these results, we recommend us-         Generalization
ing YARDS to generate either very large synthetic datasets       We trained two very large datasets composed of synthetic
or smaller supplementary synthetic datasets to boost manu-       data for three separate games: Super Mario World, Super
ally labeled images.                                             Street Fighter II, and Earthbound (screenshot dimension:
                                                                 256 × 224). We generated 20,000 images for each game,
Binary Classification of Synthetic vs. Real Data                 combining them into a total dataset of 60,000 images. We
To test whether computer vision models could distinguish         split the dataset at a 0.8 train-test ratio and generated two
between real and synthetic data, we trained LeNet-5 (Lecun       variants: one with clipping and one without. The results can
et al. 1998), AlexNet (Krizhevsky, Sutskever, and Hinton         be seen in Table 4. Model performance dropped for games
2012), and ResNet-50 (He et al. 2015)—three classic CNN          outside of those the model was trained on, which may be
architectures of increasing complexity—to classify real ver-     due to the lack of sprites representing the larger sphere of
Figure 7: Detection Results of Models Trained on Large Dataset with Clipping for Super Mario Bros., Super Street Fighter II,
and Earthbound


NES/SNES games. Given a wider variety of sprites and              that could greatly improve our current approach would be to
games, however, we believe in the possibility of training a       define spatial curves in addition to sprite frequency spaces.
general detection model for most NES/SNES games.                  Using the spatial curves to paste sprites into regions where
                                                                  they would appear in real gameplay images can serve as
                        Discussion                                a way of increasing the realism of the synthetic images.
                                                                  Third, trying out existing approaches, such as domain ran-
Although models trained on purely synthetic datasets do not       domization (Liu, Liu, and Luo 2020; Borrego et al. 2018;
perform as well as those trained on purely real datasets, our     Tremblay et al. 2018), increasing the accuracy of images
results suggest that training models with mixed synthetic-        in relation to natural data (Liu, Liu, and Luo 2020), us-
and-real datasets can increase overall sprite detection per-      ing generative models or GANS (Goodfellow et al. 2014;
formance. Furthermore, synthesizing artificial data is much       Liu, Liu, and Luo 2020; Bailo, Ham, and Shin 2019; Tri-
more efficient than collecting real gameplay screenshots and      astcyn and Faltings 2018), and procedural content genera-
accelerates object-detection model development, ultimately        tion (Nikolenko 2019), may provide further insight into how
allowing for better-performing models. Additionally, large        to refine our approach.
synthetic datasets may outweigh the advantages of using real
images and make sprite detection tools more accessible.
   Training videogame object detectors on mixed datasets             Our YARDS implementation can also benefit from addi-
can be useful for many applications. For example, it may          tional features. First, incorporating multiprocessing would
accelerate research in automated game design learning and         greatly increase the speed of synthetic data generation. Our
help relax the requirement of deep visibility into the in-        current package generates 80 images per second on a sin-
ner workings of emulated game hardware for distinguish-           gle core with no GPU acceleration for Super Mario Bros.,
ing game sprites from the level geometry. We also envision        and parallelizing this task would increase the package’s ef-
methods for game developers to improve the accessibility          ficiency. Second, adding basic image rendering and filtering
of their games by verifying that a trained model recognizes       functions such as blurring or pixelating sprites may be use-
sprites and their labels in a way which is consistent with        ful for videogames that do not use the pixelated style and
a designer’s intention (e.g. that enemies “read” as enemies,      resolution common to the four games we examined in this
that a character is not easy to misinterpret as background        work. Third, color filtering functions may help the object de-
texture, etc.). Such a model could help predict whether fu-       tection models learn the sprites’ essential features and avoid
ture players would be able to make the same assumptions           overfitting to their color patterns. Fourth, we would like to
and easily identify the playable parts of the game.               see added support for games in a wider variety of gameplay
   Synthetic data generation could also be useful as a feature    styles and genres. Finally, adding text detection functions
extraction tool for reinforcement learning or other general       may help with including basic user-interface elements.
game-playing agents. By training models that can accurately
identify features of sprites belonging to classes like helpful,
harmful, item, enemy, etc. (perhaps borrowed from an affor-         In summary, this paper has introduced an application of
dance grammar like that of Bentley and Osborn 2019), sprite       existing synthetic data generation research to the problem
detection models may help reinforcement learning agents           of sprite detection and a software package that enables an
train faster and generalize more effectively.                     end-user to rapidly generate large, synthetic training im-
   That being said, there are numerous ways to improve            ages based on sprite frequency spaces and edge-handling.
our approach. First, we suggest using model visualization         An open-source and working prototype of YARDS is avail-
techniques (e.g. class activation maps, occlusion sensitiv-       able at https://github.com/faimSD/yards. We hope that our
ity, gradient ascent) to visualize what the models see when       paper and software package will inspire further research in
trained with synthetic versus real data. Second, an addition      sprite detection and in computer vision and games.
                        References                                 Liu, W.; Liu, J.; and Luo, B. 2020. Can synthetic data im-
                                                                   prove object detection results for remote sensing images?
Bailo, O.; Ham, D.; and Shin, Y. M. 2019. Red blood cell
image generation for data augmentation using conditional           Luo, Z.; Guzdial, M.; Liao, N.; and Riedl, M. 2018.
generative adversarial networks.                                   Player experience extraction from gameplay video. CoRR
                                                                   abs/1809.06201.
Beery, S.; Liu, Y.; Morris, D.; Piavis, J.; Kapoor, A.; Meister,
M.; Joshi, N.; and Perona, P. 2019. Synthetic examples             Nikolenko, S. I. 2019. Synthetic data for deep learning.
improve generalization for rare classes.                           Nowruzi, F. E.; Kapoor, P.; Kolhatkar, D.; Hassanat, F. A.;
                                                                   Laganiere, R.; and Rebut, J. 2019. How much real data do
Bentley, G. R., and Osborn, J. C. 2019. The videogame af-
                                                                   we actually need: Analyzing object detection performance
fordances corpus. In 2019 Experimental AI in Games Work-
                                                                   using synthetic and real data.
shop.
                                                                   Osborn, J. C.; Summerville, A.; and Mateas, M. 2017. Au-
Bochkovskiy, A.; Wang, C.-Y.; and Liao, H.-Y. M. 2020.             tomated game design learning.
Yolov4: Optimal speed and accuracy of object detection.
                                                                   Rajpura, P. S.; Bojinov, H.; and Hegde, R. S. 2017. Object
Borrego, J.; Dehban, A.; Figueiredo, R.; Moreno, P.;               detection using deep cnns trained on synthetic images.
Bernardino, A.; and Santos-Victor, J. 2018. Applying do-
                                                                   Redmon, J., and Farhadi, A. 2016. Yolo9000: Better, faster,
main randomization to synthetic data for object category de-
                                                                   stronger.
tection.
                                                                   Redmon, J., and Farhadi, A. 2018. Yolov3: An incremental
Di Cicco, M.; Potena, C.; Grisetti, G.; and Pretto, A. 2017.       improvement.
Automatic model based dataset generation for fast and accu-
rate crop and weeds detection. 2017 IEEE/RSJ International         Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2015.
Conference on Intelligent Robots and Systems (IROS).               You only look once: Unified, real-time object detection.
                                                                   Roig, C.; Varas, D.; Masuda, I.; Riveiro, J. C.; and Bou-
Dwibedi, D.; Misra, I.; and Hebert, M. 2017. Cut, paste and
                                                                   Balust, E. 2020. Unsupervised multi-label dataset gener-
learn: Surprisingly easy synthesis for instance detection.
                                                                   ation from web data.
Georgakis, G.; Mousavian, A.; Berg, A. C.; and Kosecka,            Rozantsev, A.; Lepetit, V.; and Fua, P. 2015. On rendering
J. 2017. Synthesizing training data for object detection in        synthetic images for training an object detector. Computer
indoor scenes.                                                     Vision and Image Understanding 137:24–37.
Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;           Seib, V.; Lange, B.; and Wirtz, S. 2020. Mixing real and
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.         synthetic data to enhance neural network training – a review
2014. Generative adversarial networks.                             of current approaches.
Guzdial, M., and Riedl, M. 2016. Toward game level gener-          Shaked, S., and Rokach, L. 2020. Privgen: Preserving pri-
ation from gameplay videos.                                        vacy of sequences through data generation.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual        Summerville, A.; Guzdial, M.; Mateas, M.; and Riedl, M.
learning for image recognition. CoRR abs/1512.03385.               2016a. Learning player tailored content from observation:
Hinterstoisser, S.; Pauly, O.; Heibel, H.; Marek, M.; and          Platformer level generation from video traces using lstms. In
Bokeloh, M. 2019. An annotation saved is an annotation             AAAI Conference on Artificial Intelligence and Interactive
earned: Using fully synthetic training for object instance de-     Digital Entertainment.
tection.                                                           Summerville, A. J.; Snodgrass, S.; Mateas, M.; and On-
Koylu, C.; Zhao, C.; and Shao, W. 2019. Deep neural net-           tanón, S. 2016b. The vglc: The video game level corpus.
works and kernel density estimation for detecting human            arXiv preprint arXiv:1606.07487.
activity patterns from geo-tagged images: A case study of          Summerville, A.; Osborn, J.; and Mateas, M. 2017. Charda:
birdwatching on flickr. ISPRS International Journal of Geo-        Causal hybrid automata recovery via dynamic analysis.
Information 8(1).                                                  arXiv preprint arXiv:1707.03336.
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012.             Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.;
Imagenet classification with deep convolutional neural net-        Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich,
works. In Pereira, F.; Burges, C. J. C.; Bottou, L.; and Wein-     A. 2014. Going deeper with convolutions.
berger, K. Q., eds., Advances in Neural Information Process-       Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani,
ing Systems 25. Curran Associates, Inc. 1097–1105.                 V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; and Birch-
Lateh, M. A.; Muda, A. K.; Yusof, Z. I. M.; Muda, N. A.;           field, S. 2018. Training deep networks with synthetic data:
and Azmi, M. S. 2017. Handling a small dataset problem             Bridging the reality gap by domain randomization.
in prediction model by employ artificial data generation ap-       Triastcyn, A., and Faltings, B. 2018. Generating artificial
proach: A review. Journal of Physics: Conference Series            data for private deep learning.
892:012016.                                                        Ultralytics. 2020. Yolov5.
Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.           Wong, M. Z.; Kunii, K.; Baylis, M.; Ong, W. H.; Kroupa,
Gradient-based learning applied to document recognition. In        P.; and Koller, S. 2019. Synthetic dataset generation for
Proceedings of the IEEE, 2278–2324.                                object-to-model deep learning in industrial applications.
Yun, K.; Nguyen, L.; Nguyen, T.; Kim, D.; Eldin, S.; Huyen,
A.; Lu, T.; and Chow, E. 2019a. Small target detection for
search and rescue operations using distributed deep learning
and synthetic data generation.
Yun, W.; Lee, J.; Kim, J.; and Kim, J. 2019b. Balancing
domain gap for object instance detection.
Zhang, X.; Zhan, Z.; Holtz, M.; and Smith, A. M. 2018.
Crawling, indexing, and retrieving moments in videogames.
In Proceedings of the 13th International Conference on the
Foundations of Digital Games, FDG ’18. New York, NY,
USA: Association for Computing Machinery.