<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Complex: Extending the Generative Capabilities of Attribute-Based Latent Space Regularization through AR-VAE-Difusion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stephen James Krol</string-name>
          <email>stephen.krol@monash.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhinav Sood</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Teresa Llano</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Creativity Support Tool, Latent Space Regularisation, Difusion Model</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SensiLab, Monash University</institution>
          ,
          <addr-line>Caulfield East, Victoria 3145</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Progress in deep learning has driven the development of diverse creativity support tools (CST) capable of producing a range of creative artefacts. However, deep generative models are not inherently controllable, posing challenges in their guidance and prompting research focused on incorporating control mechanisms into models. One such method, Attribute-Based Latent Space Regularisation (ALSR), has demonstrated notable controllability when implemented within an Attribute-Regularised Variational Autoencoder (AR-VAE) for music and simple image generation. However, ALSR's efectiveness is constrained by the generative capabilities of the AR-VAE and is unable to control generations for high-fidelity images. In this work, we add a Denoising Difusion Probabilistic Model (DDPM) to the AR-VAE and demonstrate that the resulting AR-VAE-Difusion model is capable of generating and controlling high fidelity images, thus broadening the applicability of ALSR and providing a new pathway for introducing controllability into future deep learning CSTs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Deep generative modelling has seen significant improvements over the past 5 years, with
systems now producing realistic images from text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], music with long-term structure [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
poetry [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Although some of these systems have been designed to create independent of
human input [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], many have been designed as tools to aid in the creative process [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These
tools are referred to as AI-based Creativity Support Tools (AI-CST) and have demonstrated
capabilities in various creative fields and in diferent stages of the creative process [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However,
while deep learning has allowed machines to produce more complex generations, many early
deep generative models are limited by their lack of controllability [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Although one could
argue that controllability is not an essential component for an efective Creativity Support Tool
(CST) (take for example, Brian Enos Oblique Strategy [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]), incorporating controllability into a
system can contribute to more tailored creative outcomes. This has led to research focused on
incorporating control mechanisms into deep learning models, enabling control over various
artifacts like images and music [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ].
      </p>
      <p>
        One such technique is Attribute-Based Latent Space Regularization (ALSR) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] which is a
method for latent space disentanglement utilised in the training of a Variational Autoencoder
(VAE), a deep generative model capable of generating outputs across diverse domains [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Compared to other similar techniques [
        <xref ref-type="bibr" rid="ref10 ref9">10, 9</xref>
        ], ALSR provides a simple formulation that introduces
controllability into a VAE by associating individual dimensions of the latent vector with specific
features of the generation without adversarial training. Additionally, ALSR provides more
lfexibility in the type of attribute functions that can be encoded into the latent space. VAEs that
utilize this regularization during training are referred to as Attribute-Regularized Variational
Autoencoders (AR-VAE) and have demonstrated notable results in controlling the generations
of music [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and simple images [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, while the AR-VAE demonstrates impressive
controllability, it is limited by its generative capabilities for complex images, struggling to
produce detailed outputs and often generating blurry artifacts - a common limitation of VAE
based models [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This restricts the type of images that can be controlled using ALSR and thus
the scope of its application.
      </p>
      <p>
        In this work, we address this gap and demonstrate that ALSR can be used to control the
generations of complex images by incorporating a Denoising Difusion Probabilistic Model
(DDPM) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] into the AR-VAE architecture. We build upon work on VAE-Difusion models
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and demonstrate that controllability is maintained even with the incorporation of the
difusion model. To showcase this, we utilise two diferent datasets. The first is the Curl Noise
dataset, which was created for this project and contains abstract flow-field images generated by
an agent-based line drawing system, the second is the Kaggle abstract art dataset [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] which
contains images of abstract paintings. Both datasets were selected to showcase how ALSR can
be utilised in a more artistic setting. This work widens the applicable scope of ALSR, making it
a plausible method for incorporating controllability of high fidelity images in AI-CSTs.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <sec id="sec-3-1">
        <title>2.1. AI-based Creativity Support Tools</title>
        <p>
          With the rise of novel AI algorithms, a new generation of AI-based Creativity Support Tools
(AI-CST) has been introduced to aid their use. Although the power of these tools have opened
up many possibilities for artistic creation, users often struggle with diferent aspects of the
interaction. Of relevance to this work, is the dificulty of handling unpredictable outputs [
          <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
          ]
and the lack of capabilities to explore the design space [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          AI-CSTs can vary in the type of tasks they perform and the domain of application, with tools
used for automating dificult or time consuming tasks [
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ], for aiding ideation [
          <xref ref-type="bibr" rid="ref22 ref23 ref24">22, 23, 24</xref>
          ] and
for content editing (for instance the in-painting or out-painting capabilities of Text-To-Image
(TTI) systems). Although these tools enable interactions with AI models, usually through
high-level representations of the output, the models behind them remain black boxes [
          <xref ref-type="bibr" rid="ref11 ref25">11, 25</xref>
          ].
This results in limited exploration capabilities, restricting the user’s ability to further develop
ideas and express their artistic intentions [
          <xref ref-type="bibr" rid="ref26 ref5">26, 5</xref>
          ].
        </p>
        <p>
          Although generative AI technologies exhibit creativity largely due to their unpredictable
nature, users often struggle to build upon the models’ outputs, failing to acknowledge creative
practice as a reflective process [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. While some creatives have found ways to work around
these limitations [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], users still face dificulties using these systems [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. A path to enhance the
use and interaction with state of the art AI generative models is the development of mechanisms
to allow the exploration of the design space. We show in this work that the use of ALSR on the
VAE-Difusion model provides a controllable mechanism at a more granular level (i.e. focusing
on specific dimensions of the latent space) while maintaining the quality of complex images.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Deep Generative Modelling For Images</title>
        <p>
          Denoising Difusion Probabilistic Models (DDPM) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] are generative models that have recently
demonstrated impressive results in generating various artefacts [
          <xref ref-type="bibr" rid="ref1 ref30 ref31">1, 30, 31</xref>
          ]. The training of
a difusion model involves a difusion process, which successively adds noise to an input,
and a reverse-difusion process, which trains a model that tries to predict how much noise
should be removed at each step. During inference, noise is sampled and passed through the
reverse-difusion process to produce an output.
        </p>
        <p>
          Recent advances in language modelling [
          <xref ref-type="bibr" rid="ref1 ref32">32, 1</xref>
          ] have demonstrated that difusion
controllability can be added through textual interfacing, commonly referred to as ’prompting’. While
this method proves popular and efective for the general guidance of the generative process,
arguments have been made regarding the constraints of language as an interface and how it
limits creation, particularly in the realms of abstract art or technical designs [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ].
        </p>
        <p>
          Variational Autoencoders (VAE) [34] are generative models that are trained to compress
and reconstruct data to a probability distribution. This paired with the Kullback-Leibler (KL)
Divergence [35] allows one to sample latent vectors from a probability distribution and generate
outputs using the VAEs decoder. A VAE is trained by passing an input image into the network
and using the reconstructed image to calculate a reconstruction loss, i.e., how well the model
could compress then recreate an artefact. Additionally, a KL-Divergence penalty is added to the
loss function to ensure that the latent vectors of the model, i.e., the compressed representations,
are aligned with a specified probability distribution. This allows users to sample from this
distribution and use the model to generate new artefacts. Compared to other generative models,
VAEs are easier to train and have a regularised latent space that allows for interpolations
between generations and the addition of control vectors. This approach has proven useful in
various creative domains, as seen with MusicVAE [36], which enabled users to navigate its
latent space to explore diferent melodies or drum beats. However, when compared to difusion
models or generative adversarial networks (GANs) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], VAEs often lack the generative capability
to produce highly detailed complex imagery. To address this, work has been done on combining
the VAE with a DDPM to take advantage of the VAE’s low-dimensional, interpretable latent
space, while still maintaining high-quality generations. This was first done in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], where the
authors built a DifuseVAE and showcased its generative performance on the CelebA-HQ 256
dataset [37]. In this work, we enhance the DifuseVAE model by integrating it with an AR-VAE,
obtaining a more controllable model while maintaining the quality of the generated images.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Disentangling the Latent Space for Controllability</title>
        <p>
          FaderNetworks [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] disentangle the latent space of an encoder-decoder model from specific
attributes of the images to produce controllability. Attributes are then applied using a conditional
Training
Image
        </p>
        <p>Stage 1 Training
Stage 1 Training
z</p>
        <p>Attribute Values</p>
        <p>VAE
Reconstruction</p>
        <p>Training
Image</p>
        <p>Forward Process to steps</p>
        <p>Training
Image with</p>
        <p>Noise steps
VAE Reconstruction</p>
        <p>Stage 2 Training
Forward Process to steps</p>
        <p>Reverse Process U-Net</p>
        <p>Training
Image with
Noise
steps
vector of binary categorical attributes ranging from 0 to 1. Despite demonstrating controllability,
FaderNetworks were limited to categorical attributes that had clear upper and lower bounds
making it dificult to apply this regularization to strictly continuous variables. Additionally,
FaderNetworks incorporated adversarial learning with a discriminator to disentangle the latent
space, adding an extra layer of complexity to training.</p>
        <p>
          Another way to add controllability to deep generative networks is to apply constraints to
the latent space. One method to do this is Attribute-Based Latent Space Regularization (ALSR)
presented in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Here, the authors add an extra regularization term to the loss function
of a VAE so that specific dimensions of the model’s latent vector correlate with attributes
of the generations. Increasing or decreasing the values of these dimensions would result in
the generations having more or less of these respective attributes. The authors refer to this
model as an Attribute-Regularized Variational Autoencoder (AR-VAE) and demonstrate that it
adds significant controllability to generations of the MNIST dataset [ 38] as well as generations
of monophonic measures of music [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Additionally, compared to Geodesic Latent Space
Regularization (GSLR) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], ALSR works with non-diferentiable attribute functions, expanding
the possibilities of regularized attributes. However, despite the impressive controllability, these
models are restricted by the generative limitations of the VAE where outputted images are often
blurry and lack the fine resolution of other generative models [
          <xref ref-type="bibr" rid="ref15 ref6">6, 15</xref>
          ], justifying our approach
for an AR-VAE-Difusion model.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. AR-VAE-Difusion Model</title>
      <p>
        Our AR-VAE-Difusion model is an extension of the DifuseVAE [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], wherein the VAE now
incorporates a regularisation term that corresponds to the AR-VAE [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This simple modification
allows for the introduction of attributes into a system with high image quality. We refer to the
introduced term   as the attribute loss which is computed as per Algorithm 1.
      </p>
      <p>Thus the modified training objective of the VAE in the VAE-Difusion model is of the form:
 − 
=  
+  ∗  
+  ∗  
Here,   is the reconstruction loss for which we use the mean squared error,   is the
KL-Divergence between the VAE’s latent distributions and a standard normal distribution, and
Algorithm 1 Computation of attribute loss for one mini-batch with  training examples
Input: Training Examples:  1 …</p>
      <p>Attribute Values for given training examples:
where  is the size of our mini-batch,  is the
number of attributes.</p>
      <p>Output: Attribute Loss for mini-batch

▷ stack</p>
      <p>▷ stack  
m times
m times
forEach   ∈  
end forEach</p>
      <p>←
← [ 
← [ 


←  
←</p>
      <p>do
⋯   ]
⋯   ]
−  
−  
∑</p>
      <p>
        MAE(ℎ
(
 ) − (
 ))where 
is the sign function and MAE is
the mean absolute error.  is a tunable hyperparameter to control the spread of the posterior.
  is the attribute-based regularization loss. More details regarding ALSR are described in the
AR-VAE paper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The use of ALSR introduces two new hyper-parameters during training, 
and  (in Algorithm 1).  can be used to specify the strength of the regularisation, and  is a
tunable hyperparameter to control the spread of the posterior.
Model. In our model, only the reverse process of the DDPM is conditioned on the VAE
reconstructions. This is consistent with the first formulation presented in the original DifuseVAE
paper [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
as follows:
      </p>
      <p>
        During inference, the process to control a generated output can be described as follows:
Let   ∼  ( 0, In)be a vector sampled from a normal Gaussian distribution with  dimensions.
For each dimension  corresponding to some attribute   , the value in the vector   is modified
  [] =   [] + Δ
where Δ ∈ ℝ represents the change desired for the specific attribute. In practice we found
constraining the domain of Δ yielded the best results, the domains used for each dataset was
determined through trial and error and involved identifying the values of Δ which resulted in a
generated images that were drastically diferent to the original input. These values and other
hyperparameters are recorded in the code base and are based of suggestions from the original
DifuseVAE paper [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] as well as trial and error. The modified vector   ̂ is then fed into the VAE
decoder to generate an image. This can be defined as:
 =̂  

(  ̂ )
      </p>
      <p>Subsequently, the reverse process of the DDPM is applied to generate the final output. This
involves conditioning on the VAE-generated image. More precisely we get the generation  as:
 =   

( 
|)̂ where  
∼  ( 0, I(h,w))
Here, ℎ and  represent the size of the image dataset the DDPM is trained on. The code from
this project is available here1</p>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <p>To evaluate the AR-VAE-Difusion model, two separate models were trained on two diferent
datasets to assess its capability in generating and controlling complex images. This section
details the datasets and image attributes used for training and provides an overview of the
metrics employed to evaluate disentanglement.</p>
      <sec id="sec-5-1">
        <title>4.1. Datasets</title>
        <p>The previous datasets used to evaluate ALSR were MNIST [38] and 2d-sprites [39], both of
which contain simple images and can be seen in the first two columns of Figure 2. Since the
AR-VAE has already proven its capability to control these basic images, we utilised two diferent
datasets to evaluate the performance of the AR-VAE-Difusion model. Both of the datasets
were selected due to their complexity, which we define as the degree of intricacy and richness
present within an image, and encompasses a range of factors such as details, patterns, textures
and colour variations. Both datasets are also abstract in order to simulate a more artistic model
and to test the model’s capability on attributes that are challenging to define. Additionally, the
attributes embedded within the latent space were chosen by the authors for their perceived
potential to produce interesting variations on generated outputs. A comparison of our datasets
vs the previous datasets can be seen in Figure 2.</p>
        <sec id="sec-5-1-1">
          <title>1https://github.com/SensiLab/AR-VAE-Difusion</title>
          <p>4.1.1. Curl Noise
are described below:
calculated as follows:
The curl noise dataset is a novel image dataset generated from an agent-based line drawing
system, named Curl Noise [40]. Curl Noise, utilises 14 parameters that are used to produce
abstract complex images based on flow fields. We used the Curl Noise system to generate a
dataset of approximately 90000 designs of resolution 512x512 2. After removing blank and
highly faded generations, we were left with approximately 68000 images for training and testing.
This dataset is available here [41].</p>
          <p>
            The image attributes used to test controllability of generations from the Curl Noise dataset
Pixel density: defined as the quantity and intensity of pixels present in the image, and was
where   represents each pixel value, and  is the number of pixels in the image.
Size: defined as the minimum enclosing circle of a threshed, dilated and eroded image, and was
calculated as follows:

 =
1
 =1

∑  
 
.. ( 2 +   2) ≤  ∀  ,   ∈ 
Where  is the radius of the circle and  is the set of (  ,  ) points in the design. OpenCV was
utilised to both preprocess the image and calculate the size attribute. To ensure both attributes
have equal importance in the Regularization of our variational auto-encoder’s latent space, we
standardise both attribute values.
4.1.2. Kaggle Abstract Art
The Kaggle Abstract Art [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] dataset contains 28820 512x512 RGB images of abstract art. The
image attributes used to control generations from the abstract art dataset are described below:
Colour Diversity: defined as the approximate number of perceived colors in the image. We
created this metric specifically for analyzing the variety of colors in an abstract art piece. This
attribute is calculated as:
 () = round (
          </p>
          <p>tolerance</p>
          <p>) ×  tolerance
∀(  ,   ,   ) ∈ image add ( (  ),  (  ),  (  ))to a set 
color diversityimage = ||
where  tolerance ∶= 0.299 × tolerance</p>
          <p>tolerance ∶= 0.587 × tolerance
2This was done with the consent of the creator of the Curl Noise system who has allowed distribution of this dataset
for non-commercial uses.</p>
          <p>Tolerance is a hyper-parameter that determines how close colors have to be for them to
be grouped as one. We set a tolerance of 35 to efectively group colors, determined through
trial and error and validated by visual inspection. The specific tolerance constants (0.299,
0.587, and 0.114) are borrowed from constants used in luminance calculation, following ITU-R
Recommendation BT.601 [42]. These are used to account for the perceived brightness of diferent
colors to the human eye. While alternative constants (0.2126, 0.7152, and 0.0722) as per
ITUR Recommendation BT.709 were tested [43], they failed to produce as visually accurate outcomes.
Structural Complexity: defined similarly to as in [ 44], structural complexity, attempts to
measure how structurally complex an image is, modified to act as a proxy for the aesthetic
complexity of abstract art images [45]. Intuitively, this is measured through image compression,
an image that compresses to a smaller file is perceived as less structurally complex compared to
one compressing into a larger file size. More concretely, for a given image:
1. Divide the image into patches.
2. Bin the patches into 4 values based on mean intensity.
3. Compute a compression ratio between the binned form and the original grayscale form
of the image.</p>
          <p>A larger compression ratio means that the image is more dificult to compress, and thus is more
visually complex and vice versa. The specific implementation of this attribute, based of [ 40], is
available in our code.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Disentanglement Metrics</title>
        <p>
          To investigate whether the AR-VAE on its own ofers better controllability and disentanglement
of the latent space, along with a visual examination of our results, we utilize a variety of
disentanglement metrics as used in the AR-VAE paper [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] which have demonstrated practical value
in both the image and music domains. We summarize these metrics below. The implementation
of these metrics has been borrowed from [46]. For all the metrics, except interpretability, we
compute the mean across the attributes. We hold-out 20% of the dataset to compute the
disentanglement metrics for the Curl Noise dataset. For the Abstract Art dataset, as the number of
training images is already very limited, we compute the disentanglement metrics on the entire
dataset.
        </p>
        <p>Interpretability: Interpretability measures the existence of a simple linear probabilistic
relationship between a specified attribute and the latent space [ 47].</p>
        <p>Mutual Information Gap (MIG): Ideally, each attribute should only depend on one latent
dimension. The Mutual Information Gap (MIG) helps us assess this property by computing the
diference between the top two latent dimensions that have maximal mutual information with
respect to a given attribute [48].</p>
        <p>Modularity: Modularity measures if each latent dimension encodes information on only a
single attribute. This is done by calculating the deviation from an idealized scenario where each
latent dimension has high mutual information with one attribute and zero mutual information
Disentanglement Metrics
Beta-VAE</p>
        <p>AR-VAE
Interpretability - Pixel Density</p>
        <p>Interpretability - Size
Interpretability - Structural Complexity</p>
        <p>Interpretability - Color Diversity</p>
        <p>Modularity</p>
        <p>Mutual Information Gain
Seperated Attribute Predictability
Spearman Correlation Coeficient
Beta-VAE
with respect to all other attributes [49]. High deviations imply that the latent space is not very
modular.</p>
        <p>Separated Attribute Predictability (SAP): Much like MIG, SAP computes the diference
between the top two latent dimensions that have a maximal  2 Score (for continuous attributes)
with respect to a given attribute [50].</p>
        <p>Spearman Correlation Coeficient (SCC) Score : The Spearman Correlation Coeficient
represents the degree to which the relationship between two variables can be explained by a
monotonic function. The maximum value of the Spearman Correlation Coeficient between an
attribute and each of the latent dimensions is the SCC score [50].</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results</title>
      <sec id="sec-6-1">
        <title>5.1. Disentanglement Metrics</title>
        <p>The disentanglement metrics for both a standard beta VAE and our AR-VAE can be seen in
Table 1. For the Curl Noise dataset, our AR-VAE outperforms the standard Beta VAE in all
disentanglement metrics. There is a noticeable diference between the controllability of pixel
density and size as seen when comparing the two interpretability metrics. This can also be seen
visually when comparing the controllability of Figures 4a and 4b.</p>
        <p>For the Abstract Art Dataset, the interpretability metric for the AR-VAE is higher than
that of the Beta-VAE. Between the two attributes, structural complexity displays a higher
interpretability value. Among the other metrics, the results between the two models are very
similar, this is discussed further in section 6.3. Whereas for Separated Attribute Predictability
and the Spearman Correlation Coeficient, the Beta-Vae displays slightly larger disentanglement
scores, the AR-VAE outperforms the Beta-VAE on Modularity and Mutual Information.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Controllability of Generations</title>
        <p>Generations from the VAE-Difusion model can be seen in Figure 3. As expected, when
comparing the generations of the AR-VAE (top) to the generations of the VAE-Difusion model (bottom),
the generations of the VAE-Difusion model are of higher quality, with images containing
significantly more detail. Moreover, Figure 4 illustrates that even with the incorporation of
the difusion model, image attributes still remain controllable, with changes in attributes still
maintaining the original essence of the image. More examples can be found here.3. The level of
controllability varies noticeably across diferent attributes. For instance, the controllability of
Pixel Density is visually more apparent than that of color diversity. Increasing attributes can
also have an unexpected afect on the resulting design as shown in figure 4b where increasing
size resulted in a hazy exterior being added to the second design.</p>
        <sec id="sec-6-2-1">
          <title>3See Github</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Discussion</title>
      <p>In this work, we demonstrated that ALSR can be applied to complex data using an
AR-VAEDifusion model, overcoming the original limitation of the AR-VAE model, which was unable to
generate detailed outputs, as seen in the top row of Figure 3, where generations are blurry and
undefined. We showcased the proficiency of the model to not only generate high fidelity images
from two distinct datasets, but also the ability to control the generations using four diferent
attributes (as shown in Figures 3 and 4). This broadens the scope of ALSR, unlocking the potential
for novel AI-CSTs that empower users to manipulate more complex images. Additionally,
compared to text-based interfaces, this method allows for fine-grained controllability of specific
attributes that can be specified by the developer.</p>
      <p>The AR-VAE-Difusion model also ofers a more straightforward training approach compared
to FaderNetworks, which employ adversarial training to disentangle attributes. This mitigates
potential problems associated with adversarial convergence and mode collapse [51], making
the development of powerful controllable models easier.</p>
      <sec id="sec-7-1">
        <title>6.1. Implications for AI-CSTs</title>
        <p>
          An important aspect of AI-CSTs is to leverage unpredictable behaviours that can surprise
and inspire the user [
          <xref ref-type="bibr" rid="ref20 ref5">20, 5</xref>
          ]. Although the objective of the AR-VAE-Difusion model is to add
controllability to the process, the model still retains some agency in the creative process when
applying the attribute manipulations. This is illustrated by the examples in Figure 4, where it
can be seen how the new generations introduce new elements to the images while still keeping
the essence of previous generations. How much the new generations deviate from the previous
ones depends highly on the formulation of the attributes.
        </p>
        <p>Additionally, the flexibility of ALSR provides developers with the freedom to explore and
encode diferent attribute functions. For instance, the formulation for structural complexity
followed here is correlated with a measure of aesthetic complexity identified in [ 45]; however,
this attribute could take other forms. To illustrate, we could define structural complexity based
on the amount of white space an image has, or the diversity of shapes within the image, or the
presence of repeated patterns. The formulation of attributes is flexible allowing developers to
explore diferent creative possibilities. This work also has promise beyond high-fidelity image
generation in fields such as controllable music generation [ 52] and 3D-modelling [53], as both
these fields that have shown promise with difusion models.</p>
        <p>
          Finally, the nature of the technique allows for formulations of attributes that may not have
been possible with other methods [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]; providing more flexibility in the type of attribute
functions that can be encoded.
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>6.2. Considerations for Attribute Selection</title>
        <p>Our experiments revealed that some attributes perform better than others. For example, in figure
4, changes in pixel density are more obvious than changes in colour diversity, likely due to the
complexity of the colour diversity function. If an AR-VAE-Difusion model is being designed to
produce visually predictable controllability, image attribute functions must be carefully defined.
For example, in figure 4a as the size dimension is increased in the second design, the generations
begin to grow a fuzzy exterior. While this may not have been predictable, it is still consistent
with how size was defined in section 4.1, which is as the minimum enclosing circle around the
design. Therefore, for predictable results, attribute functions must be carefully defined.</p>
        <p>The importance of attribute selection is illustrated by the diference in the disentanglement
metrics (see Table 1) between our two datasets. Compared to the attributes used in the Curl Noise
dataset, the attributes used in the Abstract Art dataset exhibit significantly higher complexity.
Consequently, an analysis of the disentanglement metrics (see Table 1) reveals that the latent
space is not as disentangled. Nevertheless, the interpretability metrics indicate that, in contrast
to a Beta-VAE, the AR-VAE still maintains a stronger linear probabilistic relationship between
the attributes of interest and the latent space. Thus, by manipulating the attribute respective
latent dimensions of the AR-VAE by largely increasing/decreasing their values, we can control
the attributes even though the latent space is not as disentangled. Despite the complexity of the
attributes, training still yields relatively controllable output, as evident from visual generations.</p>
      </sec>
      <sec id="sec-7-3">
        <title>6.3. Limitations of the Approach</title>
        <p>Although the AR-VAE-Difusion model significantly improves the generative capabilities of the
AR-VAE, there is an added computational cost in both training and inference, potentially limiting
the applicability of the model in real-word applications for users with limited computational
resources. However, as demonstrated in Figure 3, without the addition of the Difusion model,
the AR-VAE is unable to generate high fidelity generations of complex images, justifying the
increased computational cost of the AR-VAE-Difusion model. Additionally, many modern
creative tools such as photoshop or ableton require decent compute resources for their operations,
making the computational demands of the AR-VAE-Difusion model less prohibitive within
certain professional contexts.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7. Conclusion</title>
      <p>In this work we demonstrate that ALSR can be applied to more complex images through the
use of an AR-VAE-Difusion model. This extends the applicable scope of ALSR making it now
a plausible method for controlling the generations of high-fidelity images. Additionally, the
lfexibility of ALSR provides new opportunities for developers to build deep learning based
AI-CSTs that provide controllability of a wide range of attributes. Future work will involve
testing how diferent forms of attribute regularisation on the AR-VAE-Difusion model improve
the level of controllability in the model.
prompts really making art?, in: Artificial Intelligence in Music, Sound, Art and Design:
12th International Conference, EvoMUSART 2023, Held as Part of EvoStar 2023, Brno,
Czech Republic, April 12–14, 2023, Proceedings, Springer, 2023, pp. 196–211.
[34] D. P. Kingma, M. Welling, Auto-encoding variational bayes, 2022. arXiv:1312.6114.
[35] S. Kullback, R. A. Leibler, On information and suficiency, The annals of mathematical
statistics 22 (1951) 79–86.
[36] A. Roberts, J. Engel, C. Rafel, C. Hawthorne, D. Eck, A hierarchical latent vector model for
learning long-term structure in music, in: International Conference on Machine Learning
(ICML), 2018. URL: http://proceedings.mlr.press/v80/roberts18a.html.
[37] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of gans for improved quality,
stability, and variation, arXiv preprint arXiv:1710.10196 (2017).
[38] L. Deng, The mnist database of handwritten digit images for machine learning research,</p>
      <p>IEEE Signal Processing Magazine 29 (2012) 141–142.
[39] L. Matthey, I. Higgins, D. Hassabis, A. Lerchner, dsprites: Disentanglement testing sprites
dataset, https://github.com/deepmind/dsprites-dataset/, 2017.
[40] J. Mccormack, C. Cruz Gambardella, S. Krol, Creative discovery using quality-diversity
search, in: Proceedings of the Companion Conference on Genetic and Evolutionary
Computation, 2023, pp. 747–750.
[41] S. Krol, J. McCormack, A. Sood, Curl noise 90000 dataset, 2024. URL: https://bridges.monash.</p>
      <p>edu/articles/dataset/CURL_NOISE_90000_DATASET/26868943. doi:10.26180/26868943.
[42] ITU-R, Recommendation itu-r bt.601-7, 2011. https://www.itu.int/dms_pubrec/itu-r/rec/
bt/R-REC-BT.601-7-201103-I!!PDF-E.pdf.
[43] ITU-R, Recommendation itu-r bt.709-6, 2011. https://www.itu.int/dms_pubrec/itu-r/rec/
bt/R-REC-BT.709-6-201506-I!!PDF-E.pdf.
[44] S. Lakhal, A. Darmon, J.-P. Bouchaud, M. Benzaquen, Beauty and structural complexity,
Phys. Rev. Res. 2 (2020) 022058. URL: https://link.aps.org/doi/10.1103/PhysRevResearch.2.
022058. doi:10.1103/PhysRevResearch.2.022058.
[45] J. McCormack, C. Cruz Gambardella, Quality-diversity for aesthetic evolution, in:
International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of
EvoStar), Springer, 2022, pp. 369–384.
[46] A. Pati, A. Lerch, Is disentanglement enough? on latent representations for
controllable music generation, in: 22nd International Society for Music Information Retrieval
Conference (ISMIR), Online, 2021.
[47] T. Adel, Z. Ghahramani, A. Weller, Discovering interpretable representations for both
deep generative and discriminative models, in: J. Dy, A. Krause (Eds.), Proceedings of the
35th International Conference on Machine Learning, volume 80 of Proceedings of Machine
Learning Research, PMLR, 2018, pp. 50–59. URL: https://proceedings.mlr.press/v80/adel18a.
html.
[48] R. T. Q. Chen, X. Li, R. B. Grosse, D. K. Duvenaud, Isolating sources of
disentanglement in variational autoencoders, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Neural Information Processing Systems,
volume 31, Curran Associates, Inc., 2018. URL: https://proceedings.neurips.cc/paper/2018/
file/1ee3dfcd8a0645a25a35977997223d22-Paper.pdf.
[49] K. Ridgeway, M. C. Mozer, Learning deep disentangled embeddings with the
fstatistic loss, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N.
CesaBianchi, R. Garnett (Eds.), Advances in Neural Information Processing Systems,
volume 31, Curran Associates, Inc., 2018. URL: https://proceedings.neurips.cc/paper/2018/file/
2b24d495052a8ce66358eb576b8912c8-Paper.pdf.
[50] A. Kumar, P. Sattigeri, A. Balakrishnan, Variational inference of disentangled latent
concepts from unlabeled observations, in: 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
Proceedings, OpenReview.net, 2018. URL: https://openreview.net/forum?id=H1kG7GZAW.
[51] D. Saxena, J. Cao, Generative adversarial networks (gans) challenges, solutions, and future
directions, ACM Computing Surveys (CSUR) 54 (2021) 1–42.
[52] G. Mittal, J. Engel, C. G.-M. Hawthorne, I. Simon, Symbolic music generation with difusion
models, 2021. URL: https://arxiv.org/abs/2103.16091.
[53] S. Luo, W. Hu, Difusion probabilistic models for 3d point cloud generation, in: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.
2837–2845.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Hierarchical text-conditional image generation with clip latents, arXiv</article-title>
          .org (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>C.-Z. A. Huang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hawthorne</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          <string-name>
            <surname>Hofman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Eck</surname>
          </string-name>
          ,
          <article-title>Music transformer: Generating music with long-term structure</article-title>
          , arXiv preprint arXiv:
          <year>1809</year>
          .
          <volume>04281</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajcic</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McCormack</surname>
          </string-name>
          ,
          <article-title>Mirror ritual: An afective interface for emotional self-reflection</article-title>
          ,
          <source>in: Conference on Human Factors in Computing Systems - Proceedings, CHI '20</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , Ithaca,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>McCormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Giford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hutchings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Llano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yee-King</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>d'Inverno, In a silent way: Communication between ai and improvising musicians beyond sound</article-title>
          , ACM, New York, NY,
          <year>2019</year>
          . doi:https://doi.org/10.1145/3290605.3300268, paper No.
          <volume>38</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>A. H.-C. Hwang</surname>
          </string-name>
          ,
          <article-title>Too late to be creative? ai-empowered tools in creative processes</article-title>
          ,
          <source>in: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI EA '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .1145/3491101.3503549.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pouget-Abadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Generative adversarial networks</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>63</volume>
          (
          <year>2020</year>
          )
          <fpage>139</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          , Auto-encoding variational bayes,
          <year>2013</year>
          . URL: https://arxiv.org/ abs/1312.6114. doi:
          <volume>10</volume>
          .48550/ARXIV.1312.6114.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Eno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          , Oblique strategies, Opal.(
          <article-title>Limited edition, boxed set of cards</article-title>
          .)[rMAB] (
          <year>1975</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zeghidour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Denoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          ,
          <article-title>Fader networks: Manipulating images by sliding attributes</article-title>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1706.00409. doi:
          <volume>10</volume>
          .48550/ARXIV.1706.00409.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hadjeres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pachet</surname>
          </string-name>
          ,
          <article-title>Glsr-vae: Geodesic latent space regularization for variational autoencoder architectures</article-title>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1707.04588. doi:
          <volume>10</volume>
          . 48550/ARXIV.1707.04588.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bryan-Kinns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Banar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Colton,
          <string-name>
            <given-names>J.</given-names>
            <surname>Armitage</surname>
          </string-name>
          ,
          <article-title>Exploring XAI for the arts: Explaining latent space in generative music, in: eXplainable AI approaches for debugging and diagnosis</article-title>
          .,
          <year>2021</year>
          . URL: https://openreview.net/forum?id=GLhY_0xMLZr.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerch</surname>
          </string-name>
          ,
          <article-title>Attribute-based Regularization of Latent Spaces for Variational AutoEncoders</article-title>
          ,
          <source>Neural Computing and Applications</source>
          (
          <year>2020</year>
          ). URL: https://doi.org/10.1007/ s00521-020-05270-2. doi:
          <volume>10</volume>
          .1007/s00521- 020- 05270- 2.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Auto-encoding variational bayes</article-title>
          ,
          <source>arXiv preprint arXiv:1312.6114</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sami</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mobin</surname>
          </string-name>
          ,
          <article-title>A comparative study on variational autoencoders and generative adversarial networks</article-title>
          ,
          <source>in: 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , Denoising difusion probabilistic models,
          <year>2020</year>
          . URL: https://arxiv. org/abs/
          <year>2006</year>
          .11239. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>2006</year>
          .
          <volume>11239</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Difusevae: Eficient, controllable and highifdelity generation from low-dimensional latents</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2201. 00308. doi:
          <volume>10</volume>
          .48550/ARXIV.2201.00308.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ogden</surname>
          </string-name>
          , Abstract art,
          <year>2022</year>
          . URL: https://www.kaggle.com/datasets/goprogram/ abstract-art/data.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dove</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Halskov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Forlizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zimmerman</surname>
          </string-name>
          ,
          <article-title>Ux design innovation: Challenges for working with machine learning as a design material</article-title>
          ,
          <source>in: Proceedings of the 2017 chi conference on human factors in computing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>278</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>L. E. Holmquist,</surname>
          </string-name>
          <article-title>Intelligence on tap: artificial intelligence as a new design material</article-title>
          , interactions
          <volume>24</volume>
          (
          <year>2017</year>
          )
          <fpage>28</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J. J. Y.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <article-title>Artistic user expressions in ai-powered creativity support tools</article-title>
          ,
          <source>in: Adjunct Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, UIST '22 Adjunct</source>
          , Association for Computing Machinery,
          <year>2022</year>
          . URL: https://doi.org/10.1145/3526114.3558531. doi:
          <volume>10</volume>
          .1145/3526114.3558531.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J. Y.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kiheon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gingold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Adar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Hong</surname>
          </string-name>
          , Flatmagic:
          <article-title>Improving flat colorization through ai-driven design for digital comic professionals</article-title>
          ,
          <source>in: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Badr</surname>
          </string-name>
          ,
          <article-title>Towards an artificial intelligence aided design approach: application to anime faces with generative adversarial networks</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>168</volume>
          (
          <year>2020</year>
          )
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ibarrola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grace</surname>
          </string-name>
          ,
          <article-title>Towards co-creative drawing based on contrastive languageimage models</article-title>
          ,
          <source>in: The 13th International Conference on Computational Creativity (ICCC'22)</source>
          , volume
          <volume>10</volume>
          ,
          <year>2022</year>
          , p.
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zammit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liapis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. N.</given-names>
            <surname>Yannakakis</surname>
          </string-name>
          ,
          <article-title>Seeding diversity into ai art</article-title>
          ,
          <source>Proceedings of the Thirteen International Conference on Computational Creativity</source>
          , ICCC'
          <volume>22</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <article-title>Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead</article-title>
          ,
          <source>Nature machine intelligence</source>
          <volume>1</volume>
          (
          <year>2019</year>
          )
          <fpage>206</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Steinfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rosé</surname>
          </string-name>
          , J. Zimmerman,
          <article-title>Re-examining whether, why, and how human-ai interaction is uniquely dificult to design</article-title>
          , in
          <source>: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI '20</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <year>2020</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . URL: https://doi.org/10.1145/3313831.3376301. doi:
          <volume>10</volume>
          .1145/ 3313831.3376301.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Schon</surname>
          </string-name>
          ,
          <article-title>The reflective practitioner: How professionals think in action</article-title>
          , volume
          <volume>5126</volume>
          ,
          <string-name>
            <surname>Basic</surname>
            <given-names>books</given-names>
          </string-name>
          ,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>N.</given-names>
            <surname>Collins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruzicka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grierson</surname>
          </string-name>
          ,
          <article-title>Remixing ais: mind swaps, hybrainity, and splicing musical models</article-title>
          ,
          <source>in: Proc. The Joint Conference on AI Music Creativity</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>R.</given-names>
            <surname>Louie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Coenen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Terry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>Novice-ai music co-creation via ai-steering tools for deep generative models</article-title>
          ,
          <source>in: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI '20</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . URL: https://doi.org/10.1145/3313831.3376739. doi:
          <volume>10</volume>
          . 1145/3313831.3376739.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>U.</given-names>
            <surname>Singer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polyak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ashual</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Gafni</surname>
          </string-name>
          , et al.,
          <article-title>Make-a-video: Text-to-video generation without text-video data</article-title>
          ,
          <source>arXiv preprint arXiv:2209.14792</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Point-e: A system for generating 3d point clouds from complex prompts</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2212.08751. doi:
          <volume>10</volume>
          . 48550/ARXIV.2212.08751.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>J. McCormack</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cruz Gambardella</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Rajcic</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <string-name>
            <surname>Krol</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          <string-name>
            <surname>Llano</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , Is writing
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>