=Paper=
{{Paper
|id=Vol-3359/paper10
|storemode=property
|title=Stable Walk: An interactive environment for exploring Stable Diffusion outputs
|pdfUrl=https://ceur-ws.org/Vol-3359/paper10.pdf
|volume=Vol-3359
|authors=Mattias Rost,Sebastian Andreasson
|dblpUrl=https://dblp.org/rec/conf/iui/RostA23
}}
==Stable Walk: An interactive environment for exploring Stable Diffusion outputs==
<pdf width="1500px">https://ceur-ws.org/Vol-3359/paper10.pdf</pdf>
<pre>
Stable Walk: An interactive environment for exploring Stable
Diffusion outputs
Mattias Rost 2 and Sebastian Andreasson1
1
    University of Gothenburg, Gothenburg, Sweden


                                   Abstract
                                   The past year saw an advancement in text-to-image models. Several models were released as
                                   well as services made available for users to use to generate images. These have become popular
                                   because without special training, the models can generate images from a simple text prompt.
                                   However the parameter space of these models go beyond the text prompt, and skilled users can
                                   finetune the output of the models using these parameters. In this work we present ongoing work
                                   developing a tool to explore the parameter space of Stable Diffusion. The aim of the tool is to
                                   make it possible to explore the parameter space visually. In particular we present a novel way
                                   of exploring the text embedding space by allowing users to combine several prompts.

                                   Keywords 1
                                   ui, interactive tool, stable diffusion


1. Introduction                                                                                               astronauts riding on horses. Crafting prompts that
                                                                                                              render good looking images have become an art
                                                                                                              in itself, such that it is now possible to sell and buy
    The advances in text to image models has
                                                                                                              them (https://promptbase.com/).
rendered an explosion in creative uses of such
                                                                                                                  But it is not only possible to steer the output
models. In 2022, Google announced Imagen [1]
                                                                                                              through text, but the algorithms used also allow
and OpenAI announced Dall-E 2 [2]. A few
                                                                                                              for more parameterisation. Initial seed, guidance
months later Midjourney announced an open beta
                                                                                                              scale, and inference steps, are all different
of their model (https://midjourney.com/), and
                                                                                                              parameters to the algorithm that affect the output
StabilityAI announced Stable Diffusion [3].
                                                                                                              in different ways. However, it is not as
While Imagen is not publicly available, Dall-E 2
                                                                                                              immediately clear how these and other parameters
is available through an API for a small cost and
                                                                                                              can affect the output for new users.
Midjourney is available to use through Discord
                                                                                                                  In this position paper we present a web based
for a monthly fee (although using a freemium
                                                                                                              tool that allows users to explore the parameter
model allows use for free to some extent). Stable
                                                                                                              space of Stable Diffusion. In particular, it opens
Diffusion on the other hand is freely available to
                                                                                                              up Stable Diffusion to explore the text embedding
run locally on your own machine. This has
                                                                                                              space. The text embeddings are the vector
resulted in a number of online versions of Stable
                                                                                                              representations of the prompts that are the actual
Diffusion with varying pricing models and
                                                                                                              input to the diffusion algorithm to condition its
features (e.g. DreamStudio).
                                                                                                              output. The tool lets users visually map a small
    The power of these models come from the fact
                                                                                                              part of the text embedding space, and to generate
that they are able to generate realistic images from
                                                                                                              images from within this space. The aim of the tool
a single text prompt. Entering the typical example
                                                                                                              is to allow exploration of the different inputs and
prompt “an astronaut riding on a horse” will
                                                                                                              configurations of Stable Diffusion in an active and
generate an endless stream of variations of

Joint Proceedings of the ACM IUI Workshops 2023, March 2023,
Sydney, Australia
EMAIL:            mattias.rost@ait.gu.se      (A.        1);
sebastian.andreasson@ait.gu.se (A. 2)
ORCID: XXXX-XXXX-XXXX-XXXX (A. 1); XXXX-XXXX-
XXXX-XXXX (A. 2); XXXX-XXXX-XXXX-XXXX (A. 3)
                               Copyright © 2023 for this paper by its authors. Use permitted under Creative
                               Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g
                               CEUR Workshop Proceedings (CEUR-WS.org)
engaging way. Other tools allow the user to set        argue that the typical use of SD involves no
parameters, but our tool makes this exploration        curation, but instead a lot of exploring. The way
simpler.                                               we understand their pattern of conditioning, it
                                                       refers to a more strict form of conditioning than
                                                       how SD is conditioned. The use of diffusion
2. Related Work                                        models are better described as a combination of
                                                       exploring and conditioning. Our tool is one way
                                                       of opening up the means for doing this exploring
    Text to image models are not the first types of
                                                       and conditioning.
algorithm used to create content in a seemingly
automatic fashion. For instance procedural
content generators have been around in computer
graphics for a long time. Procedural content           2.1.    Stable Diffusion UIs
generators typically involve mathematical
formulas to generate e.g. trees and landscapes,            There are now several online services
using fractals or Perlin noise. These techniques       available that let users generate images using
are commonly used in game production.                  different versions of SD. DreamStudio
    It has been argued that using such techniques      (dreamstudio.ai), Stable Diffusion Online
is not widely understood, but rather seen as magic.    (stablediffusionweb.com), as well as Hugging
In order to combat this, researchers developed         Face
Danesh [4]. It allows users of Unity to explore the    (https://huggingface.co/spaces/stabilityai/stable-
procedural generators distribution space as well as    diffusion) offer ways for users to try SD for free.
“automatically searching the parameter space for       They all take a text prompt, and render images.
configurations that produce a specific outcome”        They let you customize other parameters as well
(ibid). It simplifies the ability for humans to co-    such as cfg scale, and steps (explained later).
create with the procedural generator.                  These services make this model available to
    The use of deep learning techniques is             anyone to use.
however more recent. GANs (generative                      It is also possible to run SD locally by
adversarial networks) have dominated as the main       downloading the model, and Hugging face has
technique for image generation. Developed in           integrated it into their libraries for ease of use.
2016, they have been shown to be able to generate      Developers have also implemented UIs on top of
realistic images from a training dataset, such as      these libraries, where the most popular one is
faces [5]. At last years workshop on HAI-GEN,          Stable      Diffusion      Web      UI     (WebUI)
Grabe et al. presented a framework for co-             (https://github.com/AUTOMATIC1111/stable-
creativity using GANs [6]. They showed four            diffusion-webui). WebUI aims to make all
interaction      patterns      that     applications   possible parameters available to users into an all
implementing GANs allow when used in co-               encompassing UI. As can seen in Figure 1, the UI
creation between humans and the GAN.                   is composed of text boxes, sliders, and
    Today’s diffusion text-to-image models differ      radiobuttons. While this makes the options
from the typical GAN in important ways. GANs           available, such controls do not make them
are trained on datasets and then sampled from to       intuitive. This is not necessarily negative, since
generate images like the ones described from the       WebUI does not aim to create an intuitive
training dataset. While there are conditional          interface for using Stable Diffusion. However, to
GANs and they are in other ways parameterised          explore e.g. the effect of a parameter, the user has
to allow some control, they do not allow the same      to manually change the parameter and generate a
kind of control as e.g. SD. Instead SD is pre-         new set of images with that setting.
trained on a wide range of images from which               While it is possible to create a X/Y plot
images can be sampled by describing the image in       varying two parameter values, it forces the user to
text (conditioning). This would mean the use of        generate a large array of images in one go,
these models in co-creation differ in important        rendering the exploration process less active,
ways from using GANs as described by Grabe et          where the process of analysing the change
al.. From the four interaction patterns, we would      becomes retroactive rather than proactive.
Figure 1. A screenshot of the WebUI main interface, illustrating the layout and components of the UI.


                                                      encoder. The VAE is used to convert a high
                                                      dimension image into a lower dimensional latent
2.2.    Inner Works of Stable Diffusion               space, and from the lower dimension latent space
                                                      into image space. The text encoder is used to turn
                                                      a text prompt into a text embedding. As a text
    Next we will describe some of the inner works     encoder, version one of SD uses CLIP from
of Stable Diffusion. The purpose here is not to       OpenAI [7], whereas version two uses OpenCLIP
describe it in detail, but to make some of the        [8]. These encoders are trained such that the
components visible such that they can be referred     embedding of a text description is close to the
to in later sections.                                 embedding of an image described by the
    Stable diffusion (SD) is a latent text-to-image   description.
diffusion model [3]. Essentially a model is trained       During training SD is trained on 1000
to predict noise in an image, such that we can        diffusion steps but during inference you can
remove that noise from the noisy image. The
                                                      choose how many steps you want to use to denoise
result is that an image can be generated from pure    the image. More steps generally creates a higher-
noise, by gradually removing more and more            quality image but takes longer than using fewer
noise such that an image appear from the noise.       steps. Users must choose steps to balance speed
The model is conditioned and guided with a text       and quality.
prompt such that the resulting image can be               The whole process can be described by the
described by the text prompt.                         pseudo code in figure 2. First the text encoding
    In order to accomplish this, SD consists of       and initial noise is created. Then through a
three components: A variational autoencoder           number of timesteps we ask the model to predict
(VAE), a noise prediction model, and a text           both conditioned and unconditioned noise in the
latent image. Line 8 implements classifier free       API is run on a server with a RTX 3090 graphics
guidance [9]. Line 10 removes the predicted noise     card that renders images on average just over a
from the latent image, and adds noise for the         second.
current timestep. At the last timestep it adds no        The web interface has two tabs, and under each
noise. Finally the latent image is decoded by the     tab you can vary parameters in different ways.
VAE to create the final image.                        These are: Grid and Canvas. Both tabs have some
    The exact implementation of gen_noise, line       controls in common. The controls in common are:
10, and timesteps depends on an algorithm for         a base prompt that is added to all prompts; a
how to do the reverse diffusion process, e.g.         negative prompt used for negative conditioning;
DDPM[10], DDIM[11], and LMS [12]. In this             and a seed for the random number generator.
work we use LMS.
    While the text encoder, Unet, and VAE are         3.1.    Grid
pretrained and given by SD, there are certain
parameters available for developers and users to
                                                          In the grid view the user can choose to generate
set. This paper presents a novel UI for exploring
                                                      images in a grid from a prompt. Positions in the
this parameter space in terms of output from SD.
    From the code, it should be clear that changing   grid determine the values of cfg and steps. The
the timesteps, cfg, and text_encoding should have     user can choose the number of grid positions and
                                                      thus the difference in value between grid
an impact, but it is not clear what the impact will
                                                      positions. The row in the grid determines step, and
be. Our tool help with that understanding through
a visual user interface, that we call Stable Walk.    the column determines cfg. Top row is set to 4
                                                      steps and the bottom row is set to 100 steps. The
                                                      left most column sets cfg to 4 and the right to 20.
                                                      These values were chosen empirically to span a
                                                      variety of outputs.
                                                          Initially the grid is showing placeholder
                                                      images. By clicking on an image, a request is sent
                                                      to the server that generates the image with the
                                                      given parameters. To explore how changes in cfg
Figure 2. Pseudo-code listing of generating           and/or steps affect the output, the user selects
images using Stable Diffusion.                        images in the grid by either taking small or big
                                                      steps in the grid. This enables the users to actively
                                                      traverse the parameter space to explore what
3. Stable walk                                        images are generated by the algorithm, as shown
                                                      in figure 3.
    In order to generate images using SD, the most
straightforward interface is the text prompt. Users
can type in a line of text and the output is an
image. However, in order to get better control of     3.2. Canvas -                exploring         text
the output, the diffusion process takes a number      embedding space
of parameters that users can change. E.g. we can
change the number of diffusion steps taken.               In the Canvas tab we are exploring another
Picking 100 steps instead of 10 typically generates   parameter space, namely that of text embeddings.
higher quality images, but takes 10 times as long     The UI consists of an infinite pan and zoomable
to produce. Changing the classifier free guidance     canvas. The user starts by adding prompts that can
scale changes other aspects of the image. In more     be placed onto the canvas freely. Once placed on
advanced tools, these parameters are available as     the canvas an image is generated and shown on
simple inputs. In order to get a better               the canvas. Once a few prompts have been placed
understanding of the impact of these parameters,      on the canvas, the user can then generate images
users must explore the output while varying the       that are combinations of these prompts, by
parameters.                                           clicking on a point between the prompts.
    We therefore created a tool that let users more
visually explore the parameter space through the
outputs of SD. The tool is web based and interacts
with SD through a custom API. In our setup the
Figure 3. Grid view with prompt “Starry night in style of monet”.


Figure 4. The UI of Canvas, showing three prompts “Cat”, “Owl”, and “Hedgehog” laid out in three
corners of a triangle. Seven images have been generated as combinations of those three prompts.
    The combinations are generated by calculating            Using Stable Diffusion and other text to image
a new text embedding from the text embeddings            models is certainly Human-AI co-creation. Using
of the existing text prompts. This is calculated by      prompt engineering, users can craft prompts that
taking a linear combination of the embeddings,           guide the model towards an outcome in an
where the weights correspond to the relative             iterative way. With more training, users become
position of the point on the canvas in relation to       better at knowing what prompts lead to more
the prompts. More formally, if the points of the         desired outcomes. While it is easy to generate a
prompts on the canvas are the 2d coordinates p_i,        generic image of a cat, to get a cat in a particular
and the target point (2d coordinate of the mouse         style, in a particular setting, with a particular look
cursor) is t. Then we find the scalar weights w_i        and feel, requires more nuance. Further, crafting
such that:                                               prompts will only get the user so far. There are
                                                         more parameters than prompts, such as cfg and
   t = sum(w_i * p_i), and sum(w_i) = 1.                 step counts as we have discussed in this paper. By
                                                         looking inside these algorithms we may also find
   The target text embedding, e, is then calculated      a bigger parameters space than what the standard
from the text embeddings of each prompt, pe_i,           tools make available. In this work we have started
according to                                             looking into a UI that allow users to quickly
                                                         explore cfg and steps, but also to go deeper into
    e = sum(w_i * pe_i)                                  the space of text embeddings.
                                                             The grid view enables a quick way of
    By moving the target point closer to a               exploring the effect of cfg and steps for a
particular prompt, the intuition is that the resulting   particular prompt. Other tools, such as WebUI,
embedding will contain more of that prompt and           enable this in other ways. In WebUI you can
less of the others.                                      generate a grid of images and choose parameters
    By clicking on the canvas, new images are            to vary over rows and columns, similar to in
generated in the position of the canvas                  Stable Walk. This comes at a cost since you have
corresponding to the weights of that position. By        to generate images for the entire grid. While it
clicking on the canvas, the user essentially             makes for a more exhaustive search, as it
samples the embedding space between the entered          enumerates the entire grid, we find that doing this
prompts. By zooming in on the canvas, it makes it        more manually (by tapping each image in a grid)
easy to make very fine tuned transitions between         in Stable Walk makes this search an active
weight values.                                           process, which aids learning. While being given a
                                                         full grid of images directly might let you find a
                                                         desired output quickly, it does not let you actively
                                                         consider how the outputs vary with the
                                                         parameters. We think making this search more
                                                         explicit and an active part of the user, aids
                                                         exploration beyond the current prompt.

                                                         4.1.    Exploring text embeddings

                                                            One way to interpret the linear combinations
                                                         of text embeddings is to consider the text
                                                         embeddings as vector representations of each text
                                                         prompt. These vector representations are such that
                                                         prompts with similar semantic meaning will be
                                                         close to each other and those with different
                                                         meanings will be further apart. By taking linear
                                                         combinations of text prompts, we allow ourselves
Figure 5. An image generated from 50% ant, 50%
                                                         to go between two or more points in this
horse.
                                                         embedding space. The question is how the model
                                                         will interpret more disambiguous embeddings
4. Discussion and use of Stable Walk                     such as one described as a point between the
                                                         prompt “ant” and “horse”. Such an embedding
gives much more control than to simply prompt          high cfg in the beginning of the process and lower
“ant horse”, and allows for fine grain exploration     it as the process comes to completion.
during exploration. An example of an image                 The same thing could be done with the weights
generated from a text prompt between “ant” and         of the text embedding. E.g. one can start the
“horse” is shown in Figure 5.                          process from a prompt and then gradually move
    In our own explorations we have found that         towards a different prompt by moving in text
animals, people, and cities work particularly well.    embedding space over the denoising process.
For instance the combination of cat, owl, and          We also intend to look into ways to create a UI on
hedgehog, builds an expressive space of images         top of the prompt-to-prompt technique, which
from which you can sample (an example of which         adds the capabilities to edit images by modifying
can be seen in Figure 6.). Prompts that are            prompts directly [13].
semantically further apart, such as an animal and
a person have interesting properties. We find that     6. Conclusions
there are particular points in the space where there
is a sharp shift in outputs. E.g. generating images
                                                          We have presented on-going work with a web
from text embeddings between “lobster” and
“donald trump” mostly outputs either lobster           tool that let users explore the parameter space of
when close to lobster, and faces when close to         Stable Diffusion. There is more to text-to-image
                                                       models, than simply an input prompt. Such
“donald trump”, but somewhere in between there
                                                       models have inner workings that can be played
is a point around which most of the different
images are rendered. This is opposed to outputs        around with to reach desired and creative outputs.
between cat and dog, which renders a more              While models are improving at a rapid pace, we
smooth transition between cat and dog.                 believe there is still more to be learned about how
                                                       users may interact with these models beyond
                                                       prompts.


                                                       7. REFERENCES

                                                       [1] Chitwan Saharia, William Chan, Saurabh
                                                           Saxena, Lala Li, Jay Whang, Emily Denton,
                                                           Seyed Kamyar Seyed Ghasemipour et al.
                                                           "Photorealistic Text-to-Image Diffusion
                                                           Models        with       Deep       Language
                                                           Understanding." arXiv                preprint
                                                           arXiv:2205.11487 (2022).
                                                       [2] Aditya Ramesh, Prafulla Dhariwal, Alex
                                                           Nichol, Casey Chu, and Mark Chen.
                                                           "Hierarchical      text-conditional    image
                                                           generation with clip latents." arXiv preprint
                                                           arXiv:2204.06125 (2022).
Figure 6. A combination of cat, owl, and               [3] Robin Rombach, Andreas Blattmann,
hedgehog.                                                  Dominik Lorenz, Patrick Esser, and Björn
                                                           Ommer. "High-resolution image synthesis
                                                           with      latent      diffusion      models."
5. Future Work                                             In Proceedings of the IEEE/CVF Conference
                                                           on Computer Vision and Pattern
                                                           Recognition, pp. 10684-10695. 2022.
    While in this work we have focussed on cfg,
                                                       [4] Michael Cook, Jeremy Gow, Gillian Smith,
steps, and text embeddings, we want to continue
                                                           and Simon Colton. "Danesh: Interactive tools
this work to incorporate other aspects of the
                                                           for understanding procedural content
diffusion process as well. E.g. can we control the
                                                           generators." IEEE        Transactions     on
output by parameterising cfg over each timestep
                                                           Games (2021).
in the process? In this way the user can choose a
                                                       [5] Tero Karras, Samuli Laine, and Timo Aila.
                                                           "A style-based generator architecture for
     generative adversarial networks. arXiv e-
     prints." arXiv                       preprint
     arXiv:1812.04948 (2018).
[6] Imke Grabe, Miguel González-Duque,
     Sebastian Risi, and Jichen Zhu. "Towards a
     Framework for Human-AI Interaction
     Patterns in Co-Creative GAN Applications."
     (2022).
[7] Alec Radford, Jong Wook Kim, Chris
     Hallacy, Aditya Ramesh, Gabriel Goh,
     Sandhini Agarwal, Girish Sastry et al.
     "Learning transferable visual models from
     natural        language         supervision."
     In International Conference on Machine
     Learning, pp. 8748-8763. PMLR, 2021.
[8] Mehdi Cherti, Romain Beaumont, Ross
     Wightman, Mitchell Wortsman, Gabriel
     Ilharco,     Cade      Gordon,     Christoph
     Schuhmann, Ludwig Schmidt, and Jenia
     Jitsev. "Reproducible scaling laws for
     contrastive language-image learning." arXiv
     preprint arXiv:2212.07143 (2022).
[9] Jonathan Ho, and Tim Salimans. "Classifier-
     free diffusion guidance." arXiv preprint
     arXiv:2207.12598 (2022).
[10] Jonathan Ho, Ajay Jain, and Pieter Abbeel.
     "Denoising        diffusion      probabilistic
     models." Advances in Neural Information
     Processing Systems 33 (2020): 6840-6851.
[11] Jiaming Song, Chenlin Meng, and Stefano
     Ermon. "Denoising diffusion implicit
     models." arXiv                       preprint
     arXiv:2010.02502 (2020).
[12] Tero Karras, Miika Aittala, Timo Aila, and
     Samuli Laine. "Elucidating the Design Space
     of        Diffusion-Based         Generative
     Models." arXiv                       preprint
     arXiv:2206.00364 (2022).
[13] Amir Hertz, Ron Mokady, Jay Tenenbaum,
     Kfir Aberman, Yael Pritch, and Daniel
     Cohen-Or. "Prompt-to-prompt image editing
     with cross attention control." arXiv preprint
     arXiv:2208.01626 (2022).

</pre>