<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Minecraft via Disentangled Representation Learning Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tim Merino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yifan Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julian Togelius</string-name>
          <email>julian.togelius@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Joint AIIDE Workshop on Experimental Artificial Intelligence in Games and Intelligent Narrative Technologies</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>New York University, Stern School of Business</institution>
          ,
          <addr-line>44 W 4th St, New York, NY 10012</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>New York University, Tandon School of Engineering, 6 MetroTech Center</institution>
          ,
          <addr-line>Brooklyn, NY 11201</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Procedural Content Generation, Controllable Generative AI, Disentangled Representation Learning</institution>
          ,
          <addr-line>Human-in-</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>Learning disentangled representations of data is a well-studied problem in machine learning. Disentangled representations ofer many advantages for downstream generative tasks, enabling interpretability and granular control by leveraging learned representation spaces with desirable properties. In modern generative pipelines, these representation learning models are typically just a stepping stone for downstream models, such as the popular difusion model. In this paper, we seek to explore how these representation learning models can be utilized as stand-alone generation and editing models for Procedural Content Generation. We start by extending recent work in Disentangled Representation Learning to a unique domain - the discrete, categorical 3D world of Minecraft. We present a method for learning discrete, disentangled representations of Minecraft terrain by leveraging domain knowledge to incorporate strong inductive biases in our model. We first assume a dual-factor decomposition of Minecraft terrain into “style” and “structure” generative factors. Based on this assumption, we design a dual-codebook Factorized Quantization Variational Autoencoder, using codebook sizes significantly smaller than standard practice. We demonstrate how our model learns disentangled and human-interpretable representations of these generative factors, and attempt to quantify disentanglement using intervention-based metrics. Finally we show how our learned representation enables human-level editing and generation of Minecraft terrain features via human latent authoring.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Learning disentangled representations of data is an active area of research in machine learning. The
appeal of disentangled representations can be understood from various lenses. One can view
Disentangled Representation Learning (DRL) as efort to build models that think like humans do — building
a compositional semantic representation of our observation. [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Properly disentangled
representations can also provide interpretable, explainable features. As Artificial Intelligence increasingly
pervades aspects of everyday life, the need for explainable AI becomes quite pressing and clear. Finally,
and most relevant to this work, DRL provides an alternate method for controllable generation, without
the arduous requirement of labeled data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The appeals of DRL have attracted researchers across a variety of fields. We largely take inspiration
from work in the field of image generation, specifically the sub-field of neural style transfer.</p>
      <p>
        The concept of disentanglement is often cited as lacking a formalized notion of disentanglement
[
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], and standard metrics with which to measure its presence [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. One accepted definition states
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: “Disentangled representation should separate the distinct, independent and informative generative
factors of variation in the data. Single latent variables are sensitive to changes in single underlying
generative factors, while being relatively invariant to changes in other factors.” The core idea, that a set
of explanatory factors of variation underlies real-world data, seems particularly true in procedurally
generated game worlds.
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>
        Minecraft is an open world, 3D voxel-based survival game that is extremely popular in the gaming,
education, and AI research communities [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref8 ref9">8, 9, 10, 11, 12</xref>
        ]. Core to the game loop is a robust Procedural
Content Generation (PCG) system that creates nearly infinite worlds for players to explore. These
generated worlds contain a variety of biomes and terrain features, including mountains, caves, rivers,
oceans, and deserts. This existing content generator provides an attractive data source for training
generative models, easily providing infinite samples to train from. However, at first glance, there is
no clear motivation for a Procedural Content Generation via Machine learning (PCGML) approach. If
we already have a system that perfectly satisfies the needs of the game it operates in, why do we need
another generator? This ”fundamental tension of PCGML” has a simple answer: control.
      </p>
      <p>
        The field of Generative AI has been making massive leaps in controllable generative models. While
text-to-image models such as DALL-E [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and Stable Difusion [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] ofer undeniably impressive control
over realistic images, these models require extensively labeled datasets — something in short supply
for video games. Instead of the text-to-asset task, we focus on an alternate method control, using
semantically meaningful representations of data to manipulate it in desired ways. By introducing
controllability on top existing PCG systems, we introduce new and creative methods of interaction both
design and play in simulated worlds.
      </p>
      <p>
        Building upon recent advances in discrete representation learning, we attempt to answer the call put
forth by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. First, we introduce a simple inductive bias via a two-stage decoder network, introducing a
self-supervised objective for categorical voxel data. Second, we demonstrate the concrete, practical
benefits of enforcing a specific notion of style-content disentanglement on the learned representations.
Specifically, we demonstrate the potential for utilizing the learned representation for human-in-the-loop
generation and editing.
      </p>
      <p>Our contributions can be summarized as follows:
• We introduce a novel Factorized Quantization Variational Autoencoder that learns a disentangled
representation of Minecraft voxel terrain data using considerably smaller codebook sizes by
leveraging a two-stage architecture.
• We demonstrate that our model learns a meaningful and semantically rich representation of style
and structural features, assigning human-interpretable labels to each codebook vector
• We demonstrate the potential applications of this model for controllable, interactive Procedural</p>
      <p>Content Generation via Machine Learning (PCGML).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Disentanglement</title>
        <p>
          InfoGAN [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] is an early application of unsupervised DRL in GAN-based models, using a mutual
information regularization objective between the latent variable and observation to encourage
disentanglement. Using this approach, they learn interpretable latents that control digit type, rotation, and
width of digits on MNIST, as well latents for controlling features such as hair style and emotion in
images of celebrity faces.
        </p>
        <p>
          Later works in image generative models narrow the scope of disentanglement, aiming to specifically
to disentangle concepts of “style” from “content” in images [
          <xref ref-type="bibr" rid="ref16 ref17 ref18 ref19 ref20">16, 17, 18, 19, 20</xref>
          ]. This enables Neural
Style Transfer, where learned representations are used to render the content of one image in the
style of another. While the concept of style and content disentanglement is well-understood and well
explored in image generation, there has been comparatively little exploration of this paradigm in the
context of games. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] applies CycleGAN for cross-domain image-style transfer to render videos of the
game Fornite in the style of another game, Player Unknown’s Battlegrounds. Other work evaluates
style/content disentanglement pretrained Vision Transformer representations across a variety of video
games, aiming to find domain-general embeddings of content [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
        <p>
          While there has been at least one exploration into 3D neural style transfer [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], they limit their
domain to binary voxels, using a geometric concept of style (e.g “cube” + “bunny”). We seek to explore
content and style disentanglement in a new domain — the 3D categorical voxel world of Minecraft —
using analogous concepts of “style” to those found in traditional image style transfer applications.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. VQ Models</title>
        <p>network, which generates a reconstruction  =̂ (
lfow through the non-diferentiable quantization process.</p>
        <p>
          Vector Quantized Variational AutoEncoders (VQVAEs) are discrete representation learning models
that extend the classic Variational AutoEncoder [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. Rather than learning a continuous, compressed
representation of input data, VQVAEs encode data using codes from a fixed-size codebook of vectors.
        </p>
        <p>VQVAEs consist of at least two trainable networks: an encoder network E, and a decoder network
G. Additionally, the model learns a codebook C, which contains a finite amount ( || ) of  -dimensional
codebook vectors. The encoder model takes as input  , which is passed through the encoder to produce
encoding   = () . This encoding is then quantized by looking up the nearest codebook vector


=  ||
 −   || in the quantization step. This codebook vector is then used as input to the decoder
 ). A straight through estimator allows gradients to</p>
        <p>
          Semantic interpretability of codebook vectors is sometimes cited as a benefit of Vector Quantized
models such as VQVAEs. However, without disentanglement-specific regularization or inductive
biases, there is no guarantee that these latent vectors correlate with semantically relevant features of
the data distribution [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] experimentally evaluate interpretability of VQ models in model-based
Reinforcement Learning. Using Grad-CAM[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], they measure conceptual similarity between patches of
the input image that are salient to each code. They find that the codes have no guarantee of uniqueness
and limited impact on concept disentanglement, with codes rarely exceeding the embedding similarity
of randomly cropped image patches.
        </p>
        <p>
          In practice, Vector Quantization models are often used as a first-stage for downstream generative
tasks. Both Latent Difusion Models and autoregressive approaches have achieved state-of-the-art image
generation performance by utilizing discretized intermediate representations[
          <xref ref-type="bibr" rid="ref27 ref28 ref29">27, 28, 29</xref>
          ]. Codebook
sizes for these use-cases can be extreme, exceeding sizes that are feasibly human-interpretable even if
codes are semantically meaningful.
        </p>
        <p>
          To counter the issues introduced by large codebook sizes, [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] propose Factorized Quantization
Generative Adversarial Networks (FQGAN). Rather than using a single large codebook, they learn
multiple smaller sub-codebooks. Using a disentanglement regularization term, they ensure their
subcodes contain unique and complimentary information about the input. We directly build upon their
work, motivated by a need for small and interpretable conceptual codebooks.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. PCGML in Minecraft</title>
        <p>
          This work is most closely related to existing PCGML research in Minecraft. Existing approaches largely
leverage Generative Adversarial Networks (GANs) for terrain and structure generation [
          <xref ref-type="bibr" rid="ref12 ref31 ref32">31, 12, 32</xref>
          ].
Additionally, the Generative Design in Minecraft (GDMC) competition [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] has challenged participants
to create AI agents that produce functional and aesthetically pleasing settlements in Minecraft. This
work diverges in both scope and goal — we omit structures from our dataset, and focus on a new method
of controllable generation via manual latent intervention.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>We create a training dataset from Minecraft’s built in procedurally generated world system. We generate
a full Minecraft world using a random seed, using the default world generation settings as of Minecraft
version 1.12.2. We disable the “generate structures” setting, preventing houses and villages from being
generated.</p>
        <p>
          Using the Evocraft API [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ], we collect cubes of terrain with a size of 24 voxels by traversing the world
in a spiral along the X-Z plane1. We stride our sampling window by 36 blocks each step to improve
biome diversity. We collect each sample from the surface of the world, centered on the Y-coordinate of
the highest non-air block. We collect 11119 samples of terrain to use as our dataset, which includes 42
naturally occurring block types. We one-hot encode this data and apply rotation augmentations around
the X and Y axis to improve our models generalization ability. The final shape of each terrain chunk
after processing becomes 24 × 24 × 24 × 42
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Disentanglement</title>
        <p>We start by defining the two types of information (or generative factors) we want to disentangle from
our data. In this work, we choose two factors commonly used in image generation: style and content.</p>
        <p>
          The specific factors of variation these terms describe varies greatly by dataset and application, To
make clear the diference between typical image content and the 3D voxel world, we call our factors style
and structure, returning to terms used by [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. We define style as describing the categorical block types
used in a patch of Minecraft voxels. Minecraft has over one thousand such block types, all rendered
using diferent colors, textures, and occasionally shape (with the exception of air which describes an
empty voxel). Rather than encoding the exact block types used at each spatial location, we conceptualize
style as a palette, or subset, of block types. For example, a “beach” style would describe and collection
of sand and water blocks.
        </p>
        <p>
          As in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], we define structure as “the underlying geometry of the scene”. In a given patch of voxels,
structure describes which spatial locations have blocks present. We can obtain the structural factor of
an area by setting all non-air blocks to be 1, and all air blocks to be 0. This results in a binary matrix
that describes which blocks are ”on” and which are ”of”. It is worth noting that this factorization
diverges from most DRL research by discarding the factor independence assumption [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          We note that structure and style (as we have defined them) have some causal relations. For example,
the structure of a tree is closely tied to the block types leaf and log. So while our chosen factors are
represent diferent semantic concepts, they are not statistically independent. As noted by [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], this
independence condition is hard to satisfy in real-world datasets. While we may not learn a perfect
disentangled representation of these two factors, this factorization allows for intuitive ideas of style
transfer and structure transfer, which would be a beneficial feature for level designers, players, and
developers.
        </p>
        <p>
          Unlike other works that explore similar notions of style and concept disentanglement, we do not
compress these generative factors into scalar values [
          <xref ref-type="bibr" rid="ref35 ref36">35, 36</xref>
          ]. We utilize a vector-wise representation
structure [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], maintaining the spatial relationships of our data. This leverages the spatial inductive bias of
convolutional networks, enabling granular and spatially-localized control in our latent representations.
1In Minecraft, the X-Z dimensions represent the horizontal plane, and Y represents vertical position
3.3. FQVAE
design.
        </p>
        <sec id="sec-3-2-1">
          <title>3.3.1. Encoder Network</title>
          <p>
            We design a novel Factorized Quantization Variational Autoencoder model to encode our dataset of
terrain chunks into a discrete, disentangled representation. We start with a dual-codebook FQVAE,
based on the implementation in [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ]. Rather than using representation learning objectives based on
frozen pretrained models, we encourage disengagement via additional loss objectives and architectural
Our encoder network serves as a base feature extractor, learning general voxel features and outputting
a sequence feature vectors of size  . This network resembles the encoder network of a standard VQVAE,
using 3D convolutional residual blocks and gradually downsampling the spatial dimensions of our data.
Our encoder has a downsampling factor of 4, with self-attention performed at the lowest resolution.
This yields a 6 × 6 × 6 ×  feature representation. Following this feature-extractor base, we add two
as done in [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ].
feature adapter heads,  
and
          </p>
          <p>to translate base voxel features into style and structure features,</p>
          <p>Each feature adapter head contains two more 3D convolution residual blocks, which transform the
encoders features into codebook-specific feature vectors 
has dimensions 6 × 6 × 6 ×  , where n is the codebook vector size.


and</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.3.2. Quantizer and Codebooks</title>
          <p>
            We define two separate codebooks for style and structure,  
and  
respectively. For quantization,
we use exponential moving average updates (EMA) for both codebooks, which we find leads to much
higher codebook utilization than updating via the original VQVAE loss function [
            <xref ref-type="bibr" rid="ref37">37</xref>
            ].
          </p>
          <p>
            The size of our codebooks is significantly smaller than codebooks typically used in image VQVAE
models, such as the 2886 codes used in VQDifusion [
            <xref ref-type="bibr" rid="ref38">38</xref>
            ] or the 8192 codes used in DALL-E [
            <xref ref-type="bibr" rid="ref39">39</xref>
            ]. We
limit our codebook sizes for three main reasons. First, as discussed by [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ], two smaller codebooks
provide combinatorially large “conceptual codebook” by considering unique combinations of codes.
Thus, despite the diference in size, our representational capacity is not as low as it may appear. Second,
we are motivated by findings in [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ] that the majority of codes do not encode semantically similar
concepts. We hypothesize that by restricting the number of available codes for each concept codebook,
we can encourage semantically consistent representations by necessity.
          </p>
          <p>
            Third, and most importantly, small codebook sizes are needed for human-facing interactive systems.
The goal of this work is to enable human-in-the-loop generation and editing, using labeled codebook
. Each encoded representation
vectors. With large codebook sizes, it is infeasible for a human to parse and understand thousands
of labeled codes. Further, large codebooks introduce user fatigue, as it becomes tedious to navigate
through and select the appropriate code to induce a user’s desired modification to the output. For our
ifnal model, we set |  | = 16 and | 
3.3.3. Decoder
Despite the impressive results achieved in by [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ] using pretrained vision models to create a
disentanglement objective, models that can serve a similar purpose are, to our knowledge, non-existent
for 3D Minecraft voxel data. Without the representation learning objective, our model maintains the
considerably weaker disentanglement objective ℒ
involved style and structure codes towards orthogonality, but says nothing of the types of information
= 1 ∑, ( 


⊤ 

)2. This pushes the
encoded in each codebook.
          </p>
          <p>To fill this gap, we create a two-stage decoder network. By reconstructing the structure and style
information separately and sequentially, we encourage our model to encode meaningful semantic
information into the respective codebooks. Our decoder can be thought of as two separate networks,
 1 and  2, which function at diferent stages of reconstruction.</p>
          <p>In the first stage, the  1 takes as input only the structure codebook’s latent representation  
stage attempts to reconstruct a binarized representation of input  , referred to as   , which has an easily
computable ground truth via the binarization process described in Section 3.2. To return to the original

. This
spatial dimensions of  ,  1 uses transposed 3D convolutional residual layers.  ̂
binary reconstruction loss to train the first stage of our decoder network:
ℒ
=  1( 
= ℒ

( 
) We use a
,  ̂ )</p>
          <p>This self-supervised binary reconstruction objective teaches our model to first construct the
“scaffolding” of the reconstruction, providing blocks to the second stage to be “painted” with block styles.</p>
          <p>Before entering the second stage, we up-sample our style codebook vectors  
to match the
generate the fully reconstructed input  ̂ using further 3D convolutional layers.
spatial dimensions of  ̂ . We then channel-wise concatenate the up-sampled style codes and binary
reconstruction. This concatenated tensor is passed as input to the second stage, where  2 attempts to

 ̂= 
2((( ̂

), (</p>
          <p>))) We compute a categorical cross-entropy reconstruction
loss between  ̂ and input terrain chunk  . We use a stop-gradient operation to prevent the gradient
from the full reconstruction from flowing through the first stage of the decoder, which should only care
learn the binary reconstruction task. ℒ
= ℒ 
(, )̂</p>
          <p>This ensures that our style codes contain the
additional information needed to paint our binarized reconstruction with the appropriate block types
to achieve full reconstruction of  .</p>
          <p>
            We omit the adversarial and perceptual loss terms of [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ], finding the former detrimental to model
performance, and the latter impossible due to lack of suitable models. This gives us a FQVAE, rather
than an FQGAN. Our final training objective becomes:
ℒ  
=  1ℒ
+  2ℒ +  3ℒ
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Training</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Interpretability</title>
      <p>Our final model is trained for 15,000 steps on a single NVIDIA 3090, using a batch size of 8 and an Adam
optimizer with  1 = 0.9,  1 = 0.95,  = 0.0001 .</p>
      <p>One stated benefit of DRL is interpretability of learned representations. For our goal of controllable,
human-interactive systems, this is a requirement: a designer must understand the semantic meaning of
their selected code in order produce an expected change in the decoded output. To enable our model
for human-in-the-loop interactive pipelines, we leverage this interpretability to label our latent codes
in a way that a human can understand.</p>
      <sec id="sec-4-1">
        <title>4.1. Labeling Codes</title>
        <p>Both the encoder and decoder networks use self-attention layers, providing global context to each code.
As a consequence, there is no single “meaning” of each code, as each realization in voxel space will be at
least partially informed by other codes in the latent representation. Instead, we attempt to assign labels
by computing informative metrics to the relevant generative factor over voxel-space representations.
the representation we can control with the latent representations.</p>
        <p>We compute these metrics with respect to the reconstructed output of our decoder  ̂
and  ̂, as this is</p>
        <p>For our metrics, we make the simplifying assumption that each codebook vector is solely responsible
example,</p>
        <p>for the corresponding 4x4x4 area in  ̂ , due to the spatial downsampling factor of our encoder. For
[0, 0, 0] has corresponding structural information in  ̂ [0 ∶ 4, 0 ∶ 4, 0 ∶ 4].</p>
        <p>The labeled codebooks used for experiments are found in Tables 1 and 2 of the Appendix.</p>
        <p>For editing and generation via latent intervention, a useful label for structural codebooks is a
“blueprint”. This blueprint is a binary 4x4x4 array, which conveys the expected arrangement of non-air
blocks in  ̂ when this code is used.</p>
        <p>To compute this, we first construct a dictionary of size | 
|. For each structural code index, we
collect all of the corresponding 4x4x4 chunks in  ̂
across our entire encoded dataset. We visualize
the top 5 most frequent patterns for each code, along with their frequency, to act as the label for each
structure code (Table 2).</p>
        <p>Labeling the style codes follows a similar process, constructing a dictionary style code indices and
corresponding 4 × 4 × 4 areas from the full reconstruction  ̂.</p>
        <p>We compute block-type distribution across all chunks. Based on the block type distribution, we
manually assign a text label that describes the expected block types for this style code. Where possible,
we use labels corresponding to Minecraft’s biome system. For example, codes with a majority of water
blocks are assigned the label “ocean”, while codes with a majority of sand blocks are assigned “desert”.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>We first explore whether the inductive bias of our model design succeeds in learning a disentangled
representation. Fundamentally, we aim to learn a representation where style and structure are completely
disentangled. Ideally, the efects of modifying the two latent representations are fully disjoint: editing
the style codes causes a change in block types, but does not afect the topography, while changes to
structure codes change the terrains shape but leave block types unchanged.</p>
      <p>
        The lack of ground truth generative factors in our dataset prevents us from easily leveraging
supervised disentanglement metrics. We instead use unsupervised information-based disentanglement
metrics using mutual information to evaluate the modularity of our representations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <sec id="sec-5-1">
        <title>5.1. Mutual Information</title>
        <p>
          Normalized Mutual Information (NMI) has been used in the field of DRL as a basis for both supervised
and unsupervised measures of disentanglement [
          <xref ref-type="bibr" rid="ref35 ref36">36, 35</xref>
          ].
        </p>
        <p>In the supervised setting, mutual information typically measures shared information between latents
and source generative factors. Due to the lack of ground truth factors, we rely on a simpler (and perhaps
less informative) formulation, computing the mutual information between our two codebooks
   (
,</p>
        <p>A normalized mutual information score of 0 indicates that the two codebooks are completely
disentangled, while a 1 represents complete entanglement. As noted in Section 3.2, there are inherent
relationships between block types and their common topographies that make eliminating mutual
information impossible.</p>
        <p>Our best model achieves a NMI of 0.1933. While this is low, indicating mostly disentangled features, it
is not zero; indicating some persistent entanglement between the features. Exploratory analysis indicates
this is largely due to a small set of highly entangled style and structure codes, often corresponding to
causal relationships inherent to Minecraft. for example, structure codes for solid chunks of blocks have
high mutual information with the “Stone” style, as the bottom half of our dataset largely consists of
solid chunks of stone blocks.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Qualitative analysis</title>
        <p>Figures, 3, and 4 show the result of “style swapping” on chunks of Minecraft terrain, demonstrating
the disentanglement of style and structural information. By modifying the underlying style latent
representation of existing maps, we are able to realize an intuitive style swap behavior in Minecraft
voxels. We show that these latent representations are disjoint - modifications to style largely maintain
the geometric information of the original map.</p>
        <p>Figure 3 demonstrates an extreme case, using a single style code for the entirety of the map. We see
predictable results that correspond to our labeled style codebook, with the “beach” style code generating
sand blocks below water. While some style codes generate implausible results (like mountains made
out of leaves), it demonstrates that our model has learned a generalizable way of applying style codes
to unrelated voxel shapes, despite these homogeneous style encodings not appearing in the dataset.</p>
        <p>Figure 4 demonstrates a more fine-grained application, closer to the detailed editing capabilities we
wish to enable. We apply each style code in a spiral pattern, resulting in the map’s original style codes
appearing between unrelated style codes. Despite this, our model decodes these mixed latents in the
desired way, even generating spirals of land in the ocean.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Human-in-the-loop experiments</title>
        <p>We demonstrate how our learned representations enable human-in-the-loop interaction through a
series of small-scale human experiments. Figure 1 illustrates the concept used in these experiments; by
editing a map’s latent representation, we can make large visual or structural changes to a map with
just a few steps. By combining solid chunks (Structure code 12) and sloped chunks (Structure code
31) with the “grass” style code, we create a grassy mountain. Next, we stack tree-like structure codes
(Structure code 19) with the “tree” style to create a tall tree. We extend this concept into “design tasks”
and “generative tasks” to demonstrate the capabilities of this class of models.</p>
        <sec id="sec-5-3-1">
          <title>5.3.1. Design Tasks</title>
          <p>We showcase three tasks, designed to demonstrate the capabilities of the model for editing existing
Minecraft terrain via both style and structural modifications. These tasks were selected to represent
plausible edits to an existing map that a human designer (e.g. a developer or player) may wish to make
to an existing piece of the Minecraft game world.</p>
          <p>The first task “Dig a tunnel”, represents a structure-based task. Figure 5 demonstrates how this can be
accomplished using only structure latent edits, while Figure 6 extends this to create a secret mountain
base by incorporating additional style edits.</p>
          <p>The second design task,“Add a river” represents a style-based task. Figure 7 demonstrates how this
task accomplished can be accomplished using only style-latent edits, while Figure 8, show how the task
can be creatively extended using a combination of style and structure edits to create a flowing waterfall.</p>
          <p>Our final design task is “Create a floating island”. This task requires both style and structure edits, and
also tests our model’s generalization, as floating islands are typically rare in generated Minecraft worlds.
Figure 9 shows two variations using the same starting map, where we edit the latent representation to
create a floating platform of land with a tree on top. We show a grass and desert variant of the floating
island.</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>5.3.2. Generative Task</title>
          <p>While the design tasks showcase the ability for our model to edit existing Minecraft terrain (using
its encoded representations), we seek to push this further and generate entire maps from scratch. To
demonstrate this, we attempt to recreate existing terrain “blind” (i.e, not looking at the model’s true
latent representation of the map), using only our labeled style and structure codebooks and the visual
features of the map as reference. We first select two maps from the dataset that contain interesting
structure and style features, which serve as a ground truth for the experiment. Then, we attempt
to recreate these maps using entirely human-authored latent representations. We allow for multiple
refinement passes, to get as close to the target map in voxel space. Figures 10 and 11 show the results
of this experiment.</p>
          <p>In both cases, our manually constructed latents result in reconstructions that are close to the original
sample. To check whether this is just the result of memorization, we compare our constructed latent
to the map’s ground truth representation. The first map uses 14 unique style codes and 24 unique
structure codes, and the second terrain chunk uses 15 unique style codes and 28 unique structure codes.
In comparison, our constructed latents use 4 unique structure and 5 unique style codes for the first map,
and 4 unique structure and 3 unique style codes for the second.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>We proposed a Factorized Quantization VAE model for disentangling style and structure in 3D categorical
voxel worlds. Using simple self-supervised reconstruction objectives, our model learns a suitable
disentanglement of these concepts in in the discrete learned representation space. Despite discarding
the typical assumption of statistical independence in the disentangled factors, we show that it is still
possible to learn a meaningful disentanglement for downstream applications.</p>
      <p>Unlike most VQVAE models, we achieve style-structure disentanglement with a comparatively tiny
codebook size. This enables semantic labeling approach to construct a small, human-interpretable
codebook. Our assigned labels correspond to expected changes in the reconstructed output, allowing
for direct human intervention in the representation space to modify encoded maps. We demonstrate
that the labeled codebook we construct is interpretable enough to fully realize a creative vision by
interacting solely with the latent space. By essentially “building in miniature”, we are able to turn
a much more compact representation into a full 243 chunk of Minecraft terrain, which can easily be
spawned into game worlds to enhance the gameplay experience. This human-centered application of
VQVAE models — generating content using humans to author the latent representation from scratch —
is to our knowledge not explored in existing research.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Limitations</title>
      <p>
        While we successfully demonstrate some use-cases, our small scale qualitative results do not constitute
a robust exploration of our models generative ability. Early analysis of unsupervised DRL cast some
doubt on the usefulness of disentanglement for downstream learning tasks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. While later work
presents a more optimistic perspective of disentangled representations, specifically on the axes of
explainability and generalization, we do not present any empirical results that prove our system’s
benefit for downstream generative tasks. A promising area of future work is to apply this disentangled
representation to a generative Latent Difusion Models.
      </p>
      <p>
        Despite achieving good reconstruction quality, our model struggles with low frequency blocks,
which often appear as decoration. Our dataset contains 42 unique block types in our dataset, but our
style codes only broadly capture a small subset that most frequently appear in the game world. We
experiment with weighting lower frequency block types, but this did not completely fix the issue. This
may become more problematic when training on datasets with much higher categorical values, such as
ones that capture the full set of Minecraft blocks. Existing work has explored Minecraft specific block
embedding methods [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which may provide solution to this issue, though we leave experimentation
with embedding methods to future work.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
      </p>
      <sec id="sec-7-1">
        <title>Index</title>
      </sec>
      <sec id="sec-7-2">
        <title>Index Structure Blueprint</title>
        <p>The author(s) have not employed any Generative AI tools.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y</given-names>
            <surname>.-F. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <article-title>Neural language of thought models</article-title>
          ,
          <source>arXiv preprint arXiv:2402.01203</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          , W. Zhu,
          <article-title>Disentangled representation learning</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Edelman</surname>
          </string-name>
          , Representation is representation of similarities,
          <source>Behavioral and brain sciences 21</source>
          (
          <year>1998</year>
          )
          <fpage>449</fpage>
          -
          <lpage>467</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elgammal</surname>
          </string-name>
          ,
          <article-title>Pivqgan: Posture and identity disentangled image-to-image translation via vector quantization (????).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Locatello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lucic</surname>
          </string-name>
          , G. Raetsch,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bachem</surname>
          </string-name>
          ,
          <article-title>Challenging common assumptions in the unsupervised learning of disentangled representations</article-title>
          ,
          <source>in: international conference on machine learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4114</fpage>
          -
          <lpage>4124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.-A.</given-names>
            <surname>Carbonneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zaidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Boilard</surname>
          </string-name>
          , G. Gagnon,
          <article-title>Measuring disentanglement: A review of metrics</article-title>
          ,
          <source>IEEE transactions on neural networks and learning systems 35</source>
          (
          <year>2022</year>
          )
          <fpage>8747</fpage>
          -
          <lpage>8761</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <article-title>Representation learning: A review and new perspectives</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>35</volume>
          (
          <year>2013</year>
          )
          <fpage>1798</fpage>
          -
          <lpage>1828</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mandlekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anandkumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>Voyager:</surname>
          </string-name>
          <article-title>An openended embodied agent with large language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2305.16291. arXiv:
          <volume>2305</volume>
          .
          <fpage>16291</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Guss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Houghton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Topin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Codel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Veloso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <article-title>Minerl: a large-scale dataset of minecraft demonstrations</article-title>
          ,
          <source>in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI'19</source>
          , AAAI Press,
          <year>2019</year>
          , p.
          <fpage>2442</fpage>
          -
          <lpage>2448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Earle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kokkinos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Togelius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raileanu</surname>
          </string-name>
          , Dreamcraft:
          <article-title>Text-guided generation of functional 3d environments in minecraft, 2024</article-title>
          . URL: https://arxiv.org/abs/2404.15538. arXiv:
          <volume>2404</volume>
          .
          <fpage>15538</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Merino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Charity</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Togelius</surname>
          </string-name>
          ,
          <article-title>Interactive latent variable evolution for the generation of minecraft structures</article-title>
          ,
          <source>in: Proceedings of the 18th International Conference on the Foundations of Digital Games</source>
          , FDG '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          . URL: https://doi.org/10.1145/3582437.3587208. doi:
          <volume>10</volume>
          .1145/3582437.3587208.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Awiszus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schubert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rosenhahn</surname>
          </string-name>
          , World-gan:
          <article-title>a generative model for minecraft worlds</article-title>
          ,
          <source>2021 IEEE Conference on Games (CoG)</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . URL: https://api.semanticscholar.org/CorpusID: 235485259.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Hierarchical text-conditional image generation with clip latents</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2204.06125. arXiv:
          <volume>2204</volume>
          .
          <fpage>06125</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent difusion models</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2112.10752. arXiv:
          <volume>2112</volume>
          .
          <fpage>10752</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Houthooft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , Infogan:
          <article-title>Interpretable representation learning by information maximizing generative adversarial nets</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>29</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Gatys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Ecker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bethge</surname>
          </string-name>
          ,
          <article-title>A neural algorithm of artistic style</article-title>
          ,
          <source>arXiv preprint arXiv:1508.06576</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Karras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Laine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aila</surname>
          </string-name>
          ,
          <article-title>A style-based generator architecture for generative adversarial networks</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4401</fpage>
          -
          <lpage>4410</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kazemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Iranmanesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nasrabadi</surname>
          </string-name>
          ,
          <article-title>Style and content disentanglement in generative adversarial networks</article-title>
          ,
          <source>in: 2019 IEEE Winter Conference on Applications of Computer Vision</source>
          (WACV), IEEE,
          <year>2019</year>
          , pp.
          <fpage>848</fpage>
          -
          <lpage>856</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kotovenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanakoyeu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>Content and style disentanglement for artistic style transfer</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4422</fpage>
          -
          <lpage>4431</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>Generative image modeling using style and structure adversarial networks</article-title>
          ,
          <source>in: European conference on computer vision</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>318</fpage>
          -
          <lpage>335</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Trivedi</surname>
          </string-name>
          ,
          <article-title>Turning fortnite into pubg with deep learning (cyclegan</article-title>
          ),
          <source>Towards Data Science</source>
          , available at: https://towardsdatascience. com
          <article-title>/turning-fortnite-into-pubg-with-deep-learning-</article-title>
          <source>cyclegan2f9d339dcdb0 (Jun. 18</source>
          ,
          <year>2018</year>
          ) (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Trivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Makantasis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liapis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. N.</given-names>
            <surname>Yannakakis</surname>
          </string-name>
          ,
          <article-title>Towards general game representations: Decomposing games pixels into content and style</article-title>
          ,
          <source>ArXiv abs/2307</source>
          .11141 (
          <year>2023</year>
          ). URL: https: //api.semanticscholar.org/CorpusID:260091544.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menzel</surname>
          </string-name>
          ,
          <article-title>Voxel-based three-dimensional neural style transfer</article-title>
          ,
          <source>in: Advances in Computational Intelligence: 16th International Work-Conference on Artificial Neural Networks, IWANN</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , June 16-18,
          <year>2021</year>
          , Proceedings,
          <source>Part I 16</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>334</fpage>
          -
          <lpage>346</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>A. Van Den Oord</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Vinyals</surname>
          </string-name>
          , et al.,
          <source>Neural discrete representation learning</source>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>K.</given-names>
            <surname>Eaton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Balloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedl</surname>
          </string-name>
          ,
          <article-title>The interpretability of codebooks in model-based reinforcement learning is limited</article-title>
          ,
          <source>arXiv preprint arXiv:2407.19532</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cogswell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          ,
          <article-title>Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization</article-title>
          ,
          <source>CoRR abs/1610</source>
          .02391 (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/1610.02391. arXiv:
          <volume>1610</volume>
          .
          <fpage>02391</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>Taming transformers for high-resolution image synthesis</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2012</year>
          .09841. arXiv:
          <year>2012</year>
          .09841.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bond-Taylor</surname>
          </string-name>
          , P. Hessey,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sasaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Breckon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Willcocks</surname>
          </string-name>
          ,
          <article-title>Unleashing transformers: Parallel token prediction with discrete absorbing difusion for fast high-resolution image generation from vector-quantized codes</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2111.12701. arXiv:
          <volume>2111</volume>
          .
          <fpage>12701</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <article-title>Autoregressive model beats difusion: Llama for scalable image generation</article-title>
          ,
          <source>arXiv preprint arXiv:2406.06525</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. He,
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Shou</surname>
          </string-name>
          ,
          <article-title>Factorized visual tokenization and generation</article-title>
          ,
          <source>arXiv preprint arXiv:2411.16681</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Staaij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Preuss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Salge</surname>
          </string-name>
          ,
          <article-title>Terrain-adaptive pcgml in minecraft,</article-title>
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1109/ CoG60054.
          <year>2024</year>
          .
          <volume>10645652</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>T.</given-names>
            <surname>Merino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Charity</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Togelius</surname>
          </string-name>
          ,
          <article-title>Interactive latent variable evolution for the generation of minecraft structures</article-title>
          ,
          <source>in: Proceedings of the 18th International Conference on the Foundations of Digital Games</source>
          , FDG '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          . URL: https://doi.org/10.1145/3582437.3587208. doi:
          <volume>10</volume>
          .1145/3582437.3587208.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>C.</given-names>
            <surname>Salge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Canaan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Togelius</surname>
          </string-name>
          ,
          <article-title>Generative design in minecraft (gdmc) settlement generation competition</article-title>
          ,
          <source>in: Proceedings of the 13th International Conference on the Foundations of Digital Games</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>D.</given-names>
            <surname>Grbic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Palm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Najarro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Glanois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Risi</surname>
          </string-name>
          ,
          <article-title>Evocraft: A new challenge for open-endedness</article-title>
          , in: Applications of Evolutionary Computation: 24th International Conference, EvoApplications
          <year>2021</year>
          ,
          <article-title>Held as Part of EvoStar 2021</article-title>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , April 7-
          <issue>9</issue>
          ,
          <year>2021</year>
          , Proceedings 24, Springer,
          <year>2021</year>
          , pp.
          <fpage>325</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>G.</given-names>
            <surname>Baykal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kandemir</surname>
          </string-name>
          , G. Unal,
          <article-title>Disentanglement with factor quantized variational autoencoders</article-title>
          ,
          <source>arXiv preprint arXiv:2409.14851</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dorrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Whittington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          , Disentanglement via latent quantization,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>45463</fpage>
          -
          <lpage>45488</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>A. Van Den Oord</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Vinyals</surname>
          </string-name>
          , et al.,
          <source>Neural discrete representation learning</source>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Vector quantized difusion model for text-to-image synthesis</article-title>
          .
          <source>2022 ieee, in: CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , volume
          <volume>2</volume>
          ,
          <year>2021</year>
          , p.
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pavlov</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Zero-shot text-to-image generation</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2102.12092. arXiv:
          <volume>2102</volume>
          .
          <fpage>12092</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <given-names>Most</given-names>
            <surname>Frequent Blocks + Frequency</surname>
          </string-name>
          (&gt;
          <volume>10</volume>
          %) Stone 1 Stone (
          <volume>95</volume>
          .07%) Beach Sand (
          <volume>68</volume>
          .48%),
          <source>Water (10</source>
          .6%) Ocean Water (
          <volume>95</volume>
          .71%) Stone 2 Stone (
          <volume>98</volume>
          .78%) Desert Sandstone (
          <volume>54</volume>
          .33%),
          <source>Sand</source>
          (
          <volume>32</volume>
          .81%) Gravel Stone Water Stone (
          <volume>54</volume>
          .68%),
          <source>Gravel</source>
          (
          <volume>22</volume>
          .77%),
          <source>Water</source>
          (
          <volume>14</volume>
          .81%) Plains Dirt (
          <volume>41</volume>
          .21%),
          <source>Grass</source>
          (
          <volume>36</volume>
          .38%),
          <source>Sand</source>
          (
          <volume>12</volume>
          .85%) Tree Leaves (
          <volume>53</volume>
          .49%),
          <source>Stone</source>
          (
          <volume>11</volume>
          .70%) Pond Water (
          <volume>65</volume>
          .75%),
          <source>Dirt</source>
          (
          <volume>18</volume>
          .32%),
          <source>Trees, and Sand Leaves</source>
          (
          <volume>23</volume>
          .62%),
          <source>Water</source>
          (
          <volume>18</volume>
          .70%),
          <source>Log</source>
          (
          <volume>18</volume>
          .44%),
          <source>Sand (18.42%) Stone 3 Stone</source>
          (
          <volume>92</volume>
          .70%) Dirt/Grass Dirt (
          <volume>67</volume>
          .68%),
          <source>Grass</source>
          (
          <volume>13</volume>
          .60%),
          <source>Stone</source>
          (
          <volume>11</volume>
          .28%) Stone/Dirt/Grass Stone (
          <volume>43</volume>
          .79%),
          <source>Dirt</source>
          (
          <volume>35</volume>
          .31%),
          <source>Grass</source>
          (
          <volume>17</volume>
          .05%) Stone/Dirt Stone (
          <volume>57</volume>
          .07%),
          <source>Dirt (35.66%) Stone 4 Stone</source>
          (
          <volume>95</volume>
          .07%) Stone / Sandstone Stone (
          <volume>64</volume>
          .24%),
          <source>Sandstone</source>
          (
          <volume>64</volume>
          .24%)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>