<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Stable Walk: An interactive environment for exploring Stable Diffusion outputs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mattias Rost</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Andreasson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Gothenburg</institution>
          ,
          <addr-line>Gothenburg</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>The past year saw an advancement in text-to-image models. Several models were released as well as services made available for users to use to generate images. These have become popular because without special training, the models can generate images from a simple text prompt. However the parameter space of these models go beyond the text prompt, and skilled users can finetune the output of the models using these parameters. In this work we present ongoing work developing a tool to explore the parameter space of Stable Diffusion. The aim of the tool is to make it possible to explore the parameter space visually. In particular we present a novel way of exploring the text embedding space by allowing users to combine several prompts.</p>
      </abstract>
      <kwd-group>
        <kwd>1 ui</kwd>
        <kwd>interactive tool</kwd>
        <kwd>stable diffusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The advances in text to image models has
rendered an explosion in creative uses of such
models. In 2022, Google announced Imagen [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and OpenAI announced Dall-E 2 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A few
months later Midjourney announced an open beta
of their model (https://midjourney.com/), and
StabilityAI announced Stable Diffusion [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
While Imagen is not publicly available, Dall-E 2
is available through an API for a small cost and
Midjourney is available to use through Discord
for a monthly fee (although using a freemium
model allows use for free to some extent). Stable
Diffusion on the other hand is freely available to
run locally on your own machine. This has
resulted in a number of online versions of Stable
Diffusion with varying pricing models and
features (e.g. DreamStudio).
      </p>
      <p>The power of these models come from the fact
that they are able to generate realistic images from
a single text prompt. Entering the typical example
prompt “an astronaut riding on a horse” will
generate an endless stream of variations of
astronauts riding on horses. Crafting prompts that
render good looking images have become an art
in itself, such that it is now possible to sell and buy
them (https://promptbase.com/).</p>
      <p>But it is not only possible to steer the output
through text, but the algorithms used also allow
for more parameterisation. Initial seed, guidance
scale, and inference steps, are all different
parameters to the algorithm that affect the output
in different ways. However, it is not as
immediately clear how these and other parameters
can affect the output for new users.</p>
      <p>In this position paper we present a web based
tool that allows users to explore the parameter
space of Stable Diffusion. In particular, it opens
up Stable Diffusion to explore the text embedding
space. The text embeddings are the vector
representations of the prompts that are the actual
input to the diffusion algorithm to condition its
output. The tool lets users visually map a small
part of the text embedding space, and to generate
images from within this space. The aim of the tool
is to allow exploration of the different inputs and
configurations of Stable Diffusion in an active and
engaging way. Other tools allow the user to set
parameters, but our tool makes this exploration
simpler.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Text to image models are not the first types of
algorithm used to create content in a seemingly
automatic fashion. For instance procedural
content generators have been around in computer
graphics for a long time. Procedural content
generators typically involve mathematical
formulas to generate e.g. trees and landscapes,
using fractals or Perlin noise. These techniques
are commonly used in game production.</p>
      <p>
        It has been argued that using such techniques
is not widely understood, but rather seen as magic.
In order to combat this, researchers developed
Danesh [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It allows users of Unity to explore the
procedural generators distribution space as well as
“automatically searching the parameter space for
configurations that produce a specific outcome”
(ibid). It simplifies the ability for humans to
cocreate with the procedural generator.
      </p>
      <p>
        The use of deep learning techniques is
however more recent. GANs (generative
adversarial networks) have dominated as the main
technique for image generation. Developed in
2016, they have been shown to be able to generate
realistic images from a training dataset, such as
faces [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. At last years workshop on HAI-GEN,
Grabe et al. presented a framework for
cocreativity using GANs [6]. They showed four
interaction patterns that applications
implementing GANs allow when used in
cocreation between humans and the GAN.
      </p>
      <p>Today’s diffusion text-to-image models differ
from the typical GAN in important ways. GANs
are trained on datasets and then sampled from to
generate images like the ones described from the
training dataset. While there are conditional
GANs and they are in other ways parameterised
to allow some control, they do not allow the same
kind of control as e.g. SD. Instead SD is
pretrained on a wide range of images from which
images can be sampled by describing the image in
text (conditioning). This would mean the use of
these models in co-creation differ in important
ways from using GANs as described by Grabe et
al.. From the four interaction patterns, we would
argue that the typical use of SD involves no
curation, but instead a lot of exploring. The way
we understand their pattern of conditioning, it
refers to a more strict form of conditioning than
how SD is conditioned. The use of diffusion
models are better described as a combination of
exploring and conditioning. Our tool is one way
of opening up the means for doing this exploring
and conditioning.
2.1.</p>
    </sec>
    <sec id="sec-3">
      <title>Stable Diffusion UIs</title>
      <p>There are now several online services
available that let users generate images using
different versions of SD. DreamStudio
(dreamstudio.ai), Stable Diffusion Online
(stablediffusionweb.com), as well as Hugging
Face
(https://huggingface.co/spaces/stabilityai/stablediffusion) offer ways for users to try SD for free.
They all take a text prompt, and render images.
They let you customize other parameters as well
such as cfg scale, and steps (explained later).
These services make this model available to
anyone to use.</p>
      <p>It is also possible to run SD locally by
downloading the model, and Hugging face has
integrated it into their libraries for ease of use.
Developers have also implemented UIs on top of
these libraries, where the most popular one is
Stable Diffusion Web UI (WebUI)
(https://github.com/AUTOMATIC1111/stablediffusion-webui). WebUI aims to make all
possible parameters available to users into an all
encompassing UI. As can seen in Figure 1, the UI
is composed of text boxes, sliders, and
radiobuttons. While this makes the options
available, such controls do not make them
intuitive. This is not necessarily negative, since
WebUI does not aim to create an intuitive
interface for using Stable Diffusion. However, to
explore e.g. the effect of a parameter, the user has
to manually change the parameter and generate a
new set of images with that setting.</p>
      <p>While it is possible to create a X/Y plot
varying two parameter values, it forces the user to
generate a large array of images in one go,
rendering the exploration process less active,
where the process of analysing the change
becomes retroactive rather than proactive.</p>
    </sec>
    <sec id="sec-4">
      <title>Inner Works of Stable Diffusion</title>
      <p>Next we will describe some of the inner works
of Stable Diffusion. The purpose here is not to
describe it in detail, but to make some of the
components visible such that they can be referred
to in later sections.</p>
      <p>
        Stable diffusion (SD) is a latent text-to-image
diffusion model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Essentially a model is trained
to predict noise in an image, such that we can
remove that noise from the noisy image. The
result is that an image can be generated from pure
noise, by gradually removing more and more
noise such that an image appear from the noise.
The model is conditioned and guided with a text
prompt such that the resulting image can be
described by the text prompt.
      </p>
      <p>In order to accomplish this, SD consists of
three components: A variational autoencoder
(VAE), a noise prediction model, and a text
encoder. The VAE is used to convert a high
dimension image into a lower dimensional latent
space, and from the lower dimension latent space
into image space. The text encoder is used to turn
a text prompt into a text embedding. As a text
encoder, version one of SD uses CLIP from
OpenAI [7], whereas version two uses OpenCLIP
[8]. These encoders are trained such that the
embedding of a text description is close to the
embedding of an image described by the
description.</p>
      <p>During training SD is trained on 1000
diffusion steps but during inference you can
choose how many steps you want to use to denoise
the image. More steps generally creates a
higherquality image but takes longer than using fewer
steps. Users must choose steps to balance speed
and quality.</p>
      <p>The whole process can be described by the
pseudo code in figure 2. First the text encoding
and initial noise is created. Then through a
number of timesteps we ask the model to predict
both conditioned and unconditioned noise in the
latent image. Line 8 implements classifier free
guidance [9]. Line 10 removes the predicted noise
from the latent image, and adds noise for the
current timestep. At the last timestep it adds no
noise. Finally the latent image is decoded by the
VAE to create the final image.</p>
      <p>The exact implementation of gen_noise, line
10, and timesteps depends on an algorithm for
how to do the reverse diffusion process, e.g.
DDPM[10], DDIM[11], and LMS [12]. In this
work we use LMS.</p>
      <p>While the text encoder, Unet, and VAE are
pretrained and given by SD, there are certain
parameters available for developers and users to
set. This paper presents a novel UI for exploring
this parameter space in terms of output from SD.</p>
      <p>From the code, it should be clear that changing
the timesteps, cfg, and text_encoding should have
an impact, but it is not clear what the impact will
be. Our tool help with that understanding through
a visual user interface, that we call Stable Walk.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Stable walk</title>
      <p>In order to generate images using SD, the most
straightforward interface is the text prompt. Users
can type in a line of text and the output is an
image. However, in order to get better control of
the output, the diffusion process takes a number
of parameters that users can change. E.g. we can
change the number of diffusion steps taken.
Picking 100 steps instead of 10 typically generates
higher quality images, but takes 10 times as long
to produce. Changing the classifier free guidance
scale changes other aspects of the image. In more
advanced tools, these parameters are available as
simple inputs. In order to get a better
understanding of the impact of these parameters,
users must explore the output while varying the
parameters.</p>
      <p>We therefore created a tool that let users more
visually explore the parameter space through the
outputs of SD. The tool is web based and interacts
with SD through a custom API. In our setup the
API is run on a server with a RTX 3090 graphics
card that renders images on average just over a
second.</p>
      <p>The web interface has two tabs, and under each
tab you can vary parameters in different ways.
These are: Grid and Canvas. Both tabs have some
controls in common. The controls in common are:
a base prompt that is added to all prompts; a
negative prompt used for negative conditioning;
and a seed for the random number generator.
3.1.</p>
    </sec>
    <sec id="sec-6">
      <title>Grid</title>
      <p>In the grid view the user can choose to generate
images in a grid from a prompt. Positions in the
grid determine the values of cfg and steps. The
user can choose the number of grid positions and
thus the difference in value between grid
positions. The row in the grid determines step, and
the column determines cfg. Top row is set to 4
steps and the bottom row is set to 100 steps. The
left most column sets cfg to 4 and the right to 20.
These values were chosen empirically to span a
variety of outputs.</p>
      <p>Initially the grid is showing placeholder
images. By clicking on an image, a request is sent
to the server that generates the image with the
given parameters. To explore how changes in cfg
and/or steps affect the output, the user selects
images in the grid by either taking small or big
steps in the grid. This enables the users to actively
traverse the parameter space to explore what
images are generated by the algorithm, as shown
in figure 3.
3.2. Canvas
embedding space
exploring
text</p>
      <p>In the Canvas tab we are exploring another
parameter space, namely that of text embeddings.
The UI consists of an infinite pan and zoomable
canvas. The user starts by adding prompts that can
be placed onto the canvas freely. Once placed on
the canvas an image is generated and shown on
the canvas. Once a few prompts have been placed
on the canvas, the user can then generate images
that are combinations of these prompts, by
clicking on a point between the prompts.</p>
      <p>The combinations are generated by calculating
a new text embedding from the text embeddings
of the existing text prompts. This is calculated by
taking a linear combination of the embeddings,
where the weights correspond to the relative
position of the point on the canvas in relation to
the prompts. More formally, if the points of the
prompts on the canvas are the 2d coordinates p_i,
and the target point (2d coordinate of the mouse
cursor) is t. Then we find the scalar weights w_i
such that:
t = sum(w_i * p_i), and sum(w_i) = 1.</p>
      <p>The target text embedding, e, is then calculated
from the text embeddings of each prompt, pe_i,
according to</p>
      <p>e = sum(w_i * pe_i)</p>
      <p>By moving the target point closer to a
particular prompt, the intuition is that the resulting
embedding will contain more of that prompt and
less of the others.</p>
      <p>By clicking on the canvas, new images are
generated in the position of the canvas
corresponding to the weights of that position. By
clicking on the canvas, the user essentially
samples the embedding space between the entered
prompts. By zooming in on the canvas, it makes it
easy to make very fine tuned transitions between
weight values.</p>
    </sec>
    <sec id="sec-7">
      <title>4. Discussion and use of Stable Walk</title>
      <p>Using Stable Diffusion and other text to image
models is certainly Human-AI co-creation. Using
prompt engineering, users can craft prompts that
guide the model towards an outcome in an
iterative way. With more training, users become
better at knowing what prompts lead to more
desired outcomes. While it is easy to generate a
generic image of a cat, to get a cat in a particular
style, in a particular setting, with a particular look
and feel, requires more nuance. Further, crafting
prompts will only get the user so far. There are
more parameters than prompts, such as cfg and
step counts as we have discussed in this paper. By
looking inside these algorithms we may also find
a bigger parameters space than what the standard
tools make available. In this work we have started
looking into a UI that allow users to quickly
explore cfg and steps, but also to go deeper into
the space of text embeddings.</p>
      <p>The grid view enables a quick way of
exploring the effect of cfg and steps for a
particular prompt. Other tools, such as WebUI,
enable this in other ways. In WebUI you can
generate a grid of images and choose parameters
to vary over rows and columns, similar to in
Stable Walk. This comes at a cost since you have
to generate images for the entire grid. While it
makes for a more exhaustive search, as it
enumerates the entire grid, we find that doing this
more manually (by tapping each image in a grid)
in Stable Walk makes this search an active
process, which aids learning. While being given a
full grid of images directly might let you find a
desired output quickly, it does not let you actively
consider how the outputs vary with the
parameters. We think making this search more
explicit and an active part of the user, aids
exploration beyond the current prompt.
4.1.</p>
    </sec>
    <sec id="sec-8">
      <title>Exploring text embeddings</title>
      <p>One way to interpret the linear combinations
of text embeddings is to consider the text
embeddings as vector representations of each text
prompt. These vector representations are such that
prompts with similar semantic meaning will be
close to each other and those with different
meanings will be further apart. By taking linear
combinations of text prompts, we allow ourselves
to go between two or more points in this
embedding space. The question is how the model
will interpret more disambiguous embeddings
such as one described as a point between the
prompt “ant” and “horse”. Such an embedding
gives much more control than to simply prompt
“ant horse”, and allows for fine grain exploration
during exploration. An example of an image
generated from a text prompt between “ant” and
“horse” is shown in Figure 5.</p>
      <p>In our own explorations we have found that
animals, people, and cities work particularly well.
For instance the combination of cat, owl, and
hedgehog, builds an expressive space of images
from which you can sample (an example of which
can be seen in Figure 6.). Prompts that are
semantically further apart, such as an animal and
a person have interesting properties. We find that
there are particular points in the space where there
is a sharp shift in outputs. E.g. generating images
from text embeddings between “lobster” and
“donald trump” mostly outputs either lobster
when close to lobster, and faces when close to
“donald trump”, but somewhere in between there
is a point around which most of the different
images are rendered. This is opposed to outputs
between cat and dog, which renders a more
smooth transition between cat and dog.</p>
    </sec>
    <sec id="sec-9">
      <title>5. Future Work</title>
      <p>While in this work we have focussed on cfg,
steps, and text embeddings, we want to continue
this work to incorporate other aspects of the
diffusion process as well. E.g. can we control the
output by parameterising cfg over each timestep
in the process? In this way the user can choose a
high cfg in the beginning of the process and lower
it as the process comes to completion.</p>
      <p>The same thing could be done with the weights
of the text embedding. E.g. one can start the
process from a prompt and then gradually move
towards a different prompt by moving in text
embedding space over the denoising process.
We also intend to look into ways to create a UI on
top of the prompt-to-prompt technique, which
adds the capabilities to edit images by modifying
prompts directly [13].</p>
    </sec>
    <sec id="sec-10">
      <title>6. Conclusions</title>
      <p>We have presented on-going work with a web
tool that let users explore the parameter space of
Stable Diffusion. There is more to text-to-image
models, than simply an input prompt. Such
models have inner workings that can be played
around with to reach desired and creative outputs.
While models are improving at a rapid pace, we
believe there is still more to be learned about how
users may interact with these models beyond
prompts.</p>
    </sec>
    <sec id="sec-11">
      <title>7. REFERENCES</title>
      <p>generative adversarial networks. arXiv
eprints." arXiv preprint
arXiv:1812.04948 (2018).
[6] Imke Grabe, Miguel González-Duque,
Sebastian Risi, and Jichen Zhu. "Towards a
Framework for Human-AI Interaction
Patterns in Co-Creative GAN Applications."
(2022).
[7] Alec Radford, Jong Wook Kim, Chris
Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry et al.
"Learning transferable visual models from
natural language supervision."
In International Conference on Machine
Learning, pp. 8748-8763. PMLR, 2021.
[8] Mehdi Cherti, Romain Beaumont, Ross
Wightman, Mitchell Wortsman, Gabriel
Ilharco, Cade Gordon, Christoph
Schuhmann, Ludwig Schmidt, and Jenia
Jitsev. "Reproducible scaling laws for
contrastive language-image learning." arXiv
preprint arXiv:2212.07143 (2022).
[9] Jonathan Ho, and Tim Salimans.
"Classifierfree diffusion guidance." arXiv preprint
arXiv:2207.12598 (2022).
[10] Jonathan Ho, Ajay Jain, and Pieter Abbeel.
"Denoising diffusion probabilistic
models." Advances in Neural Information
Processing Systems 33 (2020): 6840-6851.
[11] Jiaming Song, Chenlin Meng, and Stefano
Ermon. "Denoising diffusion implicit
models." arXiv preprint
arXiv:2010.02502 (2020).
[12] Tero Karras, Miika Aittala, Timo Aila, and
Samuli Laine. "Elucidating the Design Space
of Diffusion-Based Generative
Models." arXiv preprint
arXiv:2206.00364 (2022).
[13] Amir Hertz, Ron Mokady, Jay Tenenbaum,
Kfir Aberman, Yael Pritch, and Daniel
Cohen-Or. "Prompt-to-prompt image editing
with cross attention control." arXiv preprint
arXiv:2208.01626 (2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Chitwan</given-names>
            <surname>Saharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>William</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Saurabh</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lala</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jay</given-names>
            <surname>Whang</surname>
          </string-name>
          , Emily Denton, Seyed Kamyar Seyed Ghasemipour et al.
          <article-title>"Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding."</article-title>
          <source>arXiv preprint arXiv:2205.11487</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , Prafulla Dhariwal, Alex Nichol, Casey Chu, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>"Hierarchical text-conditional image generation with clip latents</article-title>
          .
          <source>" arXiv preprint arXiv:2204.06125</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Robin</given-names>
            <surname>Rombach</surname>
          </string-name>
          , Andreas Blattmann, Dominik Lorenz, Patrick Esser, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Ommer</surname>
          </string-name>
          .
          <article-title>"High-resolution image synthesis with latent diffusion models."</article-title>
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>10684</fpage>
          -
          <lpage>10695</lpage>
          .
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Cook</surname>
          </string-name>
          , Jeremy Gow,
          <string-name>
            <given-names>Gillian</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Simon</given-names>
            <surname>Colton</surname>
          </string-name>
          .
          <article-title>"Danesh: Interactive tools for understanding procedural content generators</article-title>
          .
          <source>" IEEE Transactions on Games</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Tero</given-names>
            <surname>Karras</surname>
          </string-name>
          , Samuli Laine, and
          <string-name>
            <given-names>Timo</given-names>
            <surname>Aila</surname>
          </string-name>
          .
          <article-title>"A style-based generator architecture for</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>