Stable Walk: An interactive environment for exploring Stable Diffusion outputs Mattias Rost 2 and Sebastian Andreasson1 1 University of Gothenburg, Gothenburg, Sweden Abstract The past year saw an advancement in text-to-image models. Several models were released as well as services made available for users to use to generate images. These have become popular because without special training, the models can generate images from a simple text prompt. However the parameter space of these models go beyond the text prompt, and skilled users can finetune the output of the models using these parameters. In this work we present ongoing work developing a tool to explore the parameter space of Stable Diffusion. The aim of the tool is to make it possible to explore the parameter space visually. In particular we present a novel way of exploring the text embedding space by allowing users to combine several prompts. Keywords 1 ui, interactive tool, stable diffusion 1. Introduction astronauts riding on horses. Crafting prompts that render good looking images have become an art in itself, such that it is now possible to sell and buy The advances in text to image models has them (https://promptbase.com/). rendered an explosion in creative uses of such But it is not only possible to steer the output models. In 2022, Google announced Imagen [1] through text, but the algorithms used also allow and OpenAI announced Dall-E 2 [2]. A few for more parameterisation. Initial seed, guidance months later Midjourney announced an open beta scale, and inference steps, are all different of their model (https://midjourney.com/), and parameters to the algorithm that affect the output StabilityAI announced Stable Diffusion [3]. in different ways. However, it is not as While Imagen is not publicly available, Dall-E 2 immediately clear how these and other parameters is available through an API for a small cost and can affect the output for new users. Midjourney is available to use through Discord In this position paper we present a web based for a monthly fee (although using a freemium tool that allows users to explore the parameter model allows use for free to some extent). Stable space of Stable Diffusion. In particular, it opens Diffusion on the other hand is freely available to up Stable Diffusion to explore the text embedding run locally on your own machine. This has space. The text embeddings are the vector resulted in a number of online versions of Stable representations of the prompts that are the actual Diffusion with varying pricing models and input to the diffusion algorithm to condition its features (e.g. DreamStudio). output. The tool lets users visually map a small The power of these models come from the fact part of the text embedding space, and to generate that they are able to generate realistic images from images from within this space. The aim of the tool a single text prompt. Entering the typical example is to allow exploration of the different inputs and prompt “an astronaut riding on a horse” will configurations of Stable Diffusion in an active and generate an endless stream of variations of Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, Australia EMAIL: mattias.rost@ait.gu.se (A. 1); sebastian.andreasson@ait.gu.se (A. 2) ORCID: XXXX-XXXX-XXXX-XXXX (A. 1); XXXX-XXXX- XXXX-XXXX (A. 2); XXXX-XXXX-XXXX-XXXX (A. 3) Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) engaging way. Other tools allow the user to set argue that the typical use of SD involves no parameters, but our tool makes this exploration curation, but instead a lot of exploring. The way simpler. we understand their pattern of conditioning, it refers to a more strict form of conditioning than how SD is conditioned. The use of diffusion 2. Related Work models are better described as a combination of exploring and conditioning. Our tool is one way of opening up the means for doing this exploring Text to image models are not the first types of and conditioning. algorithm used to create content in a seemingly automatic fashion. For instance procedural content generators have been around in computer graphics for a long time. Procedural content 2.1. Stable Diffusion UIs generators typically involve mathematical formulas to generate e.g. trees and landscapes, There are now several online services using fractals or Perlin noise. These techniques available that let users generate images using are commonly used in game production. different versions of SD. DreamStudio It has been argued that using such techniques (dreamstudio.ai), Stable Diffusion Online is not widely understood, but rather seen as magic. (stablediffusionweb.com), as well as Hugging In order to combat this, researchers developed Face Danesh [4]. It allows users of Unity to explore the (https://huggingface.co/spaces/stabilityai/stable- procedural generators distribution space as well as diffusion) offer ways for users to try SD for free. “automatically searching the parameter space for They all take a text prompt, and render images. configurations that produce a specific outcome” They let you customize other parameters as well (ibid). It simplifies the ability for humans to co- such as cfg scale, and steps (explained later). create with the procedural generator. These services make this model available to The use of deep learning techniques is anyone to use. however more recent. GANs (generative It is also possible to run SD locally by adversarial networks) have dominated as the main downloading the model, and Hugging face has technique for image generation. Developed in integrated it into their libraries for ease of use. 2016, they have been shown to be able to generate Developers have also implemented UIs on top of realistic images from a training dataset, such as these libraries, where the most popular one is faces [5]. At last years workshop on HAI-GEN, Stable Diffusion Web UI (WebUI) Grabe et al. presented a framework for co- (https://github.com/AUTOMATIC1111/stable- creativity using GANs [6]. They showed four diffusion-webui). WebUI aims to make all interaction patterns that applications possible parameters available to users into an all implementing GANs allow when used in co- encompassing UI. As can seen in Figure 1, the UI creation between humans and the GAN. is composed of text boxes, sliders, and Today’s diffusion text-to-image models differ radiobuttons. While this makes the options from the typical GAN in important ways. GANs available, such controls do not make them are trained on datasets and then sampled from to intuitive. This is not necessarily negative, since generate images like the ones described from the WebUI does not aim to create an intuitive training dataset. While there are conditional interface for using Stable Diffusion. However, to GANs and they are in other ways parameterised explore e.g. the effect of a parameter, the user has to allow some control, they do not allow the same to manually change the parameter and generate a kind of control as e.g. SD. Instead SD is pre- new set of images with that setting. trained on a wide range of images from which While it is possible to create a X/Y plot images can be sampled by describing the image in varying two parameter values, it forces the user to text (conditioning). This would mean the use of generate a large array of images in one go, these models in co-creation differ in important rendering the exploration process less active, ways from using GANs as described by Grabe et where the process of analysing the change al.. From the four interaction patterns, we would becomes retroactive rather than proactive. Figure 1. A screenshot of the WebUI main interface, illustrating the layout and components of the UI. encoder. The VAE is used to convert a high dimension image into a lower dimensional latent 2.2. Inner Works of Stable Diffusion space, and from the lower dimension latent space into image space. The text encoder is used to turn a text prompt into a text embedding. As a text Next we will describe some of the inner works encoder, version one of SD uses CLIP from of Stable Diffusion. The purpose here is not to OpenAI [7], whereas version two uses OpenCLIP describe it in detail, but to make some of the [8]. These encoders are trained such that the components visible such that they can be referred embedding of a text description is close to the to in later sections. embedding of an image described by the Stable diffusion (SD) is a latent text-to-image description. diffusion model [3]. Essentially a model is trained During training SD is trained on 1000 to predict noise in an image, such that we can diffusion steps but during inference you can remove that noise from the noisy image. The choose how many steps you want to use to denoise result is that an image can be generated from pure the image. More steps generally creates a higher- noise, by gradually removing more and more quality image but takes longer than using fewer noise such that an image appear from the noise. steps. Users must choose steps to balance speed The model is conditioned and guided with a text and quality. prompt such that the resulting image can be The whole process can be described by the described by the text prompt. pseudo code in figure 2. First the text encoding In order to accomplish this, SD consists of and initial noise is created. Then through a three components: A variational autoencoder number of timesteps we ask the model to predict (VAE), a noise prediction model, and a text both conditioned and unconditioned noise in the latent image. Line 8 implements classifier free API is run on a server with a RTX 3090 graphics guidance [9]. Line 10 removes the predicted noise card that renders images on average just over a from the latent image, and adds noise for the second. current timestep. At the last timestep it adds no The web interface has two tabs, and under each noise. Finally the latent image is decoded by the tab you can vary parameters in different ways. VAE to create the final image. These are: Grid and Canvas. Both tabs have some The exact implementation of gen_noise, line controls in common. The controls in common are: 10, and timesteps depends on an algorithm for a base prompt that is added to all prompts; a how to do the reverse diffusion process, e.g. negative prompt used for negative conditioning; DDPM[10], DDIM[11], and LMS [12]. In this and a seed for the random number generator. work we use LMS. While the text encoder, Unet, and VAE are 3.1. Grid pretrained and given by SD, there are certain parameters available for developers and users to In the grid view the user can choose to generate set. This paper presents a novel UI for exploring images in a grid from a prompt. Positions in the this parameter space in terms of output from SD. From the code, it should be clear that changing grid determine the values of cfg and steps. The the timesteps, cfg, and text_encoding should have user can choose the number of grid positions and thus the difference in value between grid an impact, but it is not clear what the impact will positions. The row in the grid determines step, and be. Our tool help with that understanding through a visual user interface, that we call Stable Walk. the column determines cfg. Top row is set to 4 steps and the bottom row is set to 100 steps. The left most column sets cfg to 4 and the right to 20. These values were chosen empirically to span a variety of outputs. Initially the grid is showing placeholder images. By clicking on an image, a request is sent to the server that generates the image with the given parameters. To explore how changes in cfg Figure 2. Pseudo-code listing of generating and/or steps affect the output, the user selects images using Stable Diffusion. images in the grid by either taking small or big steps in the grid. This enables the users to actively traverse the parameter space to explore what 3. Stable walk images are generated by the algorithm, as shown in figure 3. In order to generate images using SD, the most straightforward interface is the text prompt. Users can type in a line of text and the output is an image. However, in order to get better control of 3.2. Canvas - exploring text the output, the diffusion process takes a number embedding space of parameters that users can change. E.g. we can change the number of diffusion steps taken. In the Canvas tab we are exploring another Picking 100 steps instead of 10 typically generates parameter space, namely that of text embeddings. higher quality images, but takes 10 times as long The UI consists of an infinite pan and zoomable to produce. Changing the classifier free guidance canvas. The user starts by adding prompts that can scale changes other aspects of the image. In more be placed onto the canvas freely. Once placed on advanced tools, these parameters are available as the canvas an image is generated and shown on simple inputs. In order to get a better the canvas. Once a few prompts have been placed understanding of the impact of these parameters, on the canvas, the user can then generate images users must explore the output while varying the that are combinations of these prompts, by parameters. clicking on a point between the prompts. We therefore created a tool that let users more visually explore the parameter space through the outputs of SD. The tool is web based and interacts with SD through a custom API. In our setup the Figure 3. Grid view with prompt “Starry night in style of monet”. Figure 4. The UI of Canvas, showing three prompts “Cat”, “Owl”, and “Hedgehog” laid out in three corners of a triangle. Seven images have been generated as combinations of those three prompts. The combinations are generated by calculating Using Stable Diffusion and other text to image a new text embedding from the text embeddings models is certainly Human-AI co-creation. Using of the existing text prompts. This is calculated by prompt engineering, users can craft prompts that taking a linear combination of the embeddings, guide the model towards an outcome in an where the weights correspond to the relative iterative way. With more training, users become position of the point on the canvas in relation to better at knowing what prompts lead to more the prompts. More formally, if the points of the desired outcomes. While it is easy to generate a prompts on the canvas are the 2d coordinates p_i, generic image of a cat, to get a cat in a particular and the target point (2d coordinate of the mouse style, in a particular setting, with a particular look cursor) is t. Then we find the scalar weights w_i and feel, requires more nuance. Further, crafting such that: prompts will only get the user so far. There are more parameters than prompts, such as cfg and t = sum(w_i * p_i), and sum(w_i) = 1. step counts as we have discussed in this paper. By looking inside these algorithms we may also find The target text embedding, e, is then calculated a bigger parameters space than what the standard from the text embeddings of each prompt, pe_i, tools make available. In this work we have started according to looking into a UI that allow users to quickly explore cfg and steps, but also to go deeper into e = sum(w_i * pe_i) the space of text embeddings. The grid view enables a quick way of By moving the target point closer to a exploring the effect of cfg and steps for a particular prompt, the intuition is that the resulting particular prompt. Other tools, such as WebUI, embedding will contain more of that prompt and enable this in other ways. In WebUI you can less of the others. generate a grid of images and choose parameters By clicking on the canvas, new images are to vary over rows and columns, similar to in generated in the position of the canvas Stable Walk. This comes at a cost since you have corresponding to the weights of that position. By to generate images for the entire grid. While it clicking on the canvas, the user essentially makes for a more exhaustive search, as it samples the embedding space between the entered enumerates the entire grid, we find that doing this prompts. By zooming in on the canvas, it makes it more manually (by tapping each image in a grid) easy to make very fine tuned transitions between in Stable Walk makes this search an active weight values. process, which aids learning. While being given a full grid of images directly might let you find a desired output quickly, it does not let you actively consider how the outputs vary with the parameters. We think making this search more explicit and an active part of the user, aids exploration beyond the current prompt. 4.1. Exploring text embeddings One way to interpret the linear combinations of text embeddings is to consider the text embeddings as vector representations of each text prompt. These vector representations are such that prompts with similar semantic meaning will be close to each other and those with different meanings will be further apart. By taking linear combinations of text prompts, we allow ourselves Figure 5. An image generated from 50% ant, 50% to go between two or more points in this horse. embedding space. The question is how the model will interpret more disambiguous embeddings 4. Discussion and use of Stable Walk such as one described as a point between the prompt “ant” and “horse”. Such an embedding gives much more control than to simply prompt high cfg in the beginning of the process and lower “ant horse”, and allows for fine grain exploration it as the process comes to completion. during exploration. An example of an image The same thing could be done with the weights generated from a text prompt between “ant” and of the text embedding. E.g. one can start the “horse” is shown in Figure 5. process from a prompt and then gradually move In our own explorations we have found that towards a different prompt by moving in text animals, people, and cities work particularly well. embedding space over the denoising process. For instance the combination of cat, owl, and We also intend to look into ways to create a UI on hedgehog, builds an expressive space of images top of the prompt-to-prompt technique, which from which you can sample (an example of which adds the capabilities to edit images by modifying can be seen in Figure 6.). Prompts that are prompts directly [13]. semantically further apart, such as an animal and a person have interesting properties. We find that 6. Conclusions there are particular points in the space where there is a sharp shift in outputs. E.g. generating images We have presented on-going work with a web from text embeddings between “lobster” and “donald trump” mostly outputs either lobster tool that let users explore the parameter space of when close to lobster, and faces when close to Stable Diffusion. There is more to text-to-image models, than simply an input prompt. Such “donald trump”, but somewhere in between there models have inner workings that can be played is a point around which most of the different images are rendered. This is opposed to outputs around with to reach desired and creative outputs. between cat and dog, which renders a more While models are improving at a rapid pace, we smooth transition between cat and dog. believe there is still more to be learned about how users may interact with these models beyond prompts. 7. REFERENCES [1] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour et al. "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding." arXiv preprint arXiv:2205.11487 (2022). [2] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. "Hierarchical text-conditional image generation with clip latents." arXiv preprint arXiv:2204.06125 (2022). Figure 6. A combination of cat, owl, and [3] Robin Rombach, Andreas Blattmann, hedgehog. Dominik Lorenz, Patrick Esser, and Björn Ommer. "High-resolution image synthesis with latent diffusion models." 5. Future Work In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684-10695. 2022. While in this work we have focussed on cfg, [4] Michael Cook, Jeremy Gow, Gillian Smith, steps, and text embeddings, we want to continue and Simon Colton. "Danesh: Interactive tools this work to incorporate other aspects of the for understanding procedural content diffusion process as well. E.g. can we control the generators." IEEE Transactions on output by parameterising cfg over each timestep Games (2021). in the process? In this way the user can choose a [5] Tero Karras, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks. arXiv e- prints." arXiv preprint arXiv:1812.04948 (2018). [6] Imke Grabe, Miguel González-Duque, Sebastian Risi, and Jichen Zhu. "Towards a Framework for Human-AI Interaction Patterns in Co-Creative GAN Applications." (2022). [7] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry et al. "Learning transferable visual models from natural language supervision." In International Conference on Machine Learning, pp. 8748-8763. PMLR, 2021. [8] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. "Reproducible scaling laws for contrastive language-image learning." arXiv preprint arXiv:2212.07143 (2022). [9] Jonathan Ho, and Tim Salimans. "Classifier- free diffusion guidance." arXiv preprint arXiv:2207.12598 (2022). [10] Jonathan Ho, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851. [11] Jiaming Song, Chenlin Meng, and Stefano Ermon. "Denoising diffusion implicit models." arXiv preprint arXiv:2010.02502 (2020). [12] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. "Elucidating the Design Space of Diffusion-Based Generative Models." arXiv preprint arXiv:2206.00364 (2022). [13] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. "Prompt-to-prompt image editing with cross attention control." arXiv preprint arXiv:2208.01626 (2022).