<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How to Blend Concepts in Difusion Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo Olearo</string-name>
          <email>lorenzo.olearo@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Longari</string-name>
          <email>giorgio.longari@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Melzi</string-name>
          <email>simone.melzi@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Raganato</string-name>
          <email>alessandro.raganato@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Peñaloza</string-name>
          <email>rafael.penaloza@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>For the last decade, there has been a push to use multi-dimensional (latent) spaces to represent concepts; and yet how to manipulate these concepts or reason with them remains largely unclear. Some recent methods exploit multiple latent representations and their connection, making this research question even more entangled. Our goal is to understand how operations in the latent space afect the underlying concepts. We hence explore the task of concept blending through difusion models. Difusion models are based on a connection between a latent representation of textual prompts and a latent space that enables image reconstruction and generation. This task allows us to try diferent text-based combination strategies, and evaluate them visually. Our conclusion is that concept blending through space manipulation is possible, although the best strategy depends on the context.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Concept blending</kwd>
        <kwd>Generative AI</kwd>
        <kwd>Difusion models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The field of knowledge representation deals with the task of representing the knowledge of a domain in
a manner that can be used for intelligent applications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Over the decades, most of the progress in the
ifeld has focused on logic-based knowledge representation languages, and their reasoning capabilities. In
this setting, concepts—the first-class citizens of any domain representation—are formalised by limiting
the interpretations that they can be assigned to, and their connections with other concepts. A diferent,
more implicit approach represents concepts as points (or sometimes volumes) in a multidimensional
socalled latent space. This representation (or embedding) is built considering the semantic similarities and
diferences between concepts. Although at an abstract level this representation is similar to Gardenfors’s
conceptual spaces [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], there are essential diferences between the two; most notably, concept composition
cannot be achieved through simple set operations. Hence, while the use of the latent space is becoming
more common, it remains unclear how to navigate it and how to reason within this representation.
      </p>
      <p>
        Our overarching goal is to understand the properties of the latent space and how diferent operations
within it afect the underlying concepts. It is usually understood that every point in the latent space
represents a concept, and thus, navigating it has the potential of creating new concepts. In this paper,
we focus on the question of concept blending; briefly, the task of creating new concepts combining
(“blending”) the properties of two or more concepts [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (see Section 2 for more details). We explore the
possibility of constructing such blends through (text-to-image) difusion models starting from textual
prompts describing the concepts. This choice is motivated by, first, the easy access to the latent space
through the textual prompts and, second, the ability to evaluate the quality of the results visually.
      </p>
      <sec id="sec-1-1">
        <title>We study diferent strategies for concept blending which exploit the overall architecture of Stable</title>
        <p>
          Difusion (SD) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. None of these methods relies on further training or fine-tuning, but they all focus on
the topology of the latent space and the SD architecture. In general, visual depictions of concept blends
can be automatically generated through these techniques, although their quality may vary. An empirical
study was used to evaluate the relative performance of each method. The results suggest that there is no
absolute best method, but the choice of blending approach depends on the combined concepts. The task
of concept blending considered here is just a milestone towards our general goal of understanding the
properties of the latent space and how navigating it afects the underlying concepts. This understanding
will be useful towards an explainable use of latent spaces and embeddings in general.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <sec id="sec-2-1">
        <title>Human creativity has always been the key to the innovating process, giving us the possibility to imagine</title>
        <p>things which are yet to be discovered, and diverging scenarios to explore. In recent years, the field of
artificial intelligence (AI) has been revolutionized by generative models, which are capable of creating
new and original contents by exploiting the countless examples these models have been trained on.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Among the multiple variants and possibilities to exploit generative AI, difusion models like Stable</title>
      </sec>
      <sec id="sec-2-3">
        <title>Difusion [ 5], Dall-E, or Midjourney produce as output original images based on textual prompts or images given as input for the model. To provide clarity for this work, we introduce the fundamental concepts and components of Stable Difusion and the notions of concept blending.</title>
        <sec id="sec-2-3-1">
          <title>2.1. Stable Difusion</title>
          <p>
            Denoising U-Net
Stable Difusion (SD) is a text-to-image generative model
developed by Rombach et al. in 2022 [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], which follows  
the typical architecture of difusion models [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]
comprising a forward and a backward process. In the forward prompt
process, a clean sample from the data (in this case, an encoder
image) is sequentially corrupted by random noise
reaching, after a defined number of steps, pure random noise.
          </p>
          <p>
            In the backward process, a neural network is trained to
sequentially remove the noise, thereby restoring the clean data distribution; this is the main phase
intervening during image generation. The Stable Difusion network architecture utilized during the
backward phase is principally made up of (i) a Variational Autoencoder (VAE) [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], (ii) a U-Net [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ],
and (iii) an optional text encoder. The VAE characterizes SD as a Latent Difusion Model, mapping
images into a lower-dimensional space through an encoder, followed by a difusion model to craft the
distribution over encoded images. The images are then represented as points in the latent space (R).
Afterwards, a decoder is needed to convert a point back into an image.
          </p>
          <p>The U-Net is composed of an encoder-decoder pair, where the bottleneck contains the compact
embedding representation of the images. The encoder  maps the input samples according to the given
prompt embedding into this latent embedding, then the decoder  processes this latent embedding
input prompt

together with its prompt embedding to reconstruct a sample that is as close as possible to the original
one. The U-Net and text embedding are crucial in conditioning the output generated by the model. At
each step of the denoising process, the prompt embedding is injected into the three blocks of the U-Net
via cross-attention mechanism. In this way, the textual prompt conditions the denoising process and in
turn the generation of an image. The prompt embedding is generated by the text encoder, following the
pipeline of SD 1.4. In our experiments, as text encoder, we adopt a pre-trained CLIP Vit-L/14.</p>
          <p>With these details we can now establish the focus of this study through the research question: can
difusion models produce visual blends of two concepts? Identifying each concept through a word, we
want to create a new image that simultaneously represents a combination of both, simulating the human
capacity for associative thinking. To address this problem, we present various methodologies leveraging</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>SD as the backbone of our experiments. But before, we explain the notion of concept blending.</title>
        <sec id="sec-2-4-1">
          <title>2.2. Concept Blending</title>
          <p>
            Blending represents a cognitive mechanism that has been innately exploited to create new abstractions
from familiar concepts [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. This process is often experienced in our daily interactions, even during a
casual conversation. This conceptual framework has been studied over the past three decades [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ],
ofering a model that incorporates mental spaces and conceptual projection. It examines the dynamic
formation of interconnected domains as discourse unfolds, aiming to discover fundamental principles
that underlie the intricate logic of everyday interactions. In this context, a mental space is a temporary
knowledge structure, which is dynamically created, for the purpose of local understanding, during
a social interaction. It is composed of elements (concepts) and their interconnections. It is
contextdependent and not necessarily a description of reality [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. This general notion can be specified in
diferent notions. For our purpose, we are interested in visual conceptual blending, which combines
aspects of conceptual blending and visual blending.
          </p>
          <p>
            Conceptual Blending constructs a partial match between two or more input mental spaces, and project
them into a new “blended” one [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. This blended space has common characteristics of the input
spaces, allowing a mapping between its elements and their counterparts in each input space. Yet, it also
generates a new emergent conceptual structure, which is unpredictable from the input spaces and not
originally present in them. Therefore, blending occurs at the conceptual level. Representations of these
blends are valuable and frequently employed in advertising [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] and other domains [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
          </p>
          <p>
            The Visual Blending process, instead, is essential to generate new visual representations, such as
images, through the fusion of at least two existing ones. There are two primary options for visual
blending, according to the rendering style employed: photo-realistic rendering and non-photo-realistic
techniques, like drawings. Approaches that focus on text-to-image generation have as main goal the
visual representation of concepts, and, in the case of blending, the topology could to be summarized
as a bunch of visual operations, as analyzed by Phillips and McQuarrie [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. One of these operators,
called fusion, partially depicts and merges the diferent inputs to create a hybrid image, allowing for
a higher coherence between the object parts of the object(s), and helping the viewers in perceiving
the hybrid object as a unified whole. In replacement, one input concept is present and its sole function
is to occupy the usual environment of the other concept, or have its shape adapted to resemble the
other input. Juxtaposition is a technique that involves placing two diferent elements side by side, to
create a harmonious or provoking whole. Good example of Visual Blending and diferent approaches to
the operations described (and others) can be found in [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. Importantly, high-quality blends between
concepts require that only some of the main characteristics of the input concepts are taken into
account [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. Exploiting the three main visual properties of color, silhouette, and internal details, helps
the creator to obtain a great resulting blend. An image result from blending can be evaluated by taking
in account the number of dimensions (or visual properties) over which the blend has been applied.
          </p>
          <p>
            Visual Conceptual Blending introduces a model for creating visually blended images grounded in
strong conceptual reasoning. Cunha et al. [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] argue that visual conceptual blending goes beyond
simply merging two images: it emphasizes the importance of conceptual reasoning as the foundation of
the blending process, resulting in an image and accompanying conceptual elaborations. These blends
have context, are grounded in justifications, and can be named independently of the original concepts.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>In contrast, standard Visual Blending focuses solely on image merging, and typically involves mapping concepts to objects and integrating them while maintaining recognisably and inference association.</title>
        <p>We now rephrase our research question as: can Stable Difusion models merge two semantically
distant concepts into a new image, practically performing a Visual (Conceptual) Blending operation? We
investigate the eficacy of difusion models, which are supposed to recreate each image that should
be imagined, in generating high-quality blended images. We assess existing approaches to perform
blending with stable difusion, and propose novel methods. To the best of our knowledge, this is the first
investigation that evaluates the performance of diferent blending techniques with difusion models
using only textual prompts. We initially operate on the latent space where the textual prompts are
embedded, and then explore alternative methods by directly manipulating the specific architecture of
the difusion model; more precisely, the U-Net conditioning phase is manipulated to edit the textual
prompt that is injected (Section 3). To evaluate the results, we conducted a user survey where the
subjects were asked to rank the outcomes of diferent blending tasks, divided in multiple categories.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Blending Methods with Stable Difusion</title>
      <sec id="sec-3-1">
        <title>In this section, we briefly review some of the existing approaches for blending concepts with difusion models. Some of these methods were already published in previous work [18], while others are available on public implementations, but without a full description of their details. We mention explicitly whenever we are unsure if our implementation matches exactly the one proposed in the reference.</title>
        <p>
          Experimental setup We fix the generative network  as Stable Difusion v1.4 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] with the
UniPCMultistepScheduler [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] set at 25 steps. This version uses a fixed pretrained text encoder (CLIP Vit-L/14 [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>All images are generated as 512x512 pixels with the difusion process carried in FP16 precision in a</title>
        <p>space downscaled by a factor of 4. The conditioning signal is provided only in the form of textual
prompts, and the guidance scale is set to 7.5. We focused on Stable Difusion as a good trade-of between
quality and computational cost; however, the blending methods analyzed can be implemented in other
difusion models with no latent downscale. Our entire implementation of the blending methods in their
respective pipelines together with some of the generated samples is openly available.1</p>
        <p>An important feature of many generative methods, which allows them to produce varying outputs
on the same prompt, is the use of a pseudo-random number generator (and pseudo-random noise)
which can be established through a seed. Given an input textual prompt , and a seed , we denote as
, = (, ) the image generated by the model  given the input prompt  and the seed . Prompts
will be usually denoted with the letter , sometimes with additional indices to distinguish between
them; e.g., 1 and 2 when two diferent prompts are used simultaneously. Given a prompt , * denotes
its latent representation; that is, the multi-dimensional vector obtained from the encoding operation.
Similarly, *1 and *2 denote the latent representations of 1 and 2, respectively.</p>
        <sec id="sec-3-2-1">
          <title>3.1. Blending in the Prompt Latent Space (TEXTUAL)</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>The first method examined was recently proposed by</title>
        <p>
          Melzi et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. It exploits the relationship between
conceptual blending and vector operations within the
prompt latent space. Given the two input prompts 1
and 2, we first compute their latent representations *1
and *2 through the prompt encoder. The blended latent
vector is the Euclidean mean between *1 and *2. The
blended image is generated by conditioning SD with the
blended latent vector.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>1Project repository: https://github.com/LorenzoOlearo/blending-difusion-models</title>
        <p>Denoising U-Net


n
a
e
m</p>
        <p>Importantly, blending in the latent space representing the prompts does not correspond to blending
images directly, as in a visual blending process. Instead, it means generating an image representing
a specific fusion of the concepts provided as the input textual prompts. Indeed, the Euclidean mean
between the two representations is a (potentially unexplored) point of the latent space which intuitively
represents the concept that is closest to both input concepts, thus defining an “in-between”
characterisation. Although in this paper we only consider the mean of the two latent representations of the
input prompts, we highlight that Melzi et al. consider also other linear combinations of *1 and *2 to
avoid fully symmetric constructions. A similar technique was implemented in the Compel open source
library,2 which performs the weighted blend of two textual prompts.</p>
        <sec id="sec-3-4-1">
          <title>3.2. Prompt Switching in the Iterative Difusion Process (SWITCH)</title>
          <p>Denoising U-Net</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>This blending technique involves switching the textual</title>
        <p>prompt during the iterative process of the difusion  
model. The inference process first starts with a single
prompt 1 and then, at a certain iteration, the prompt prompt
is switched to 2 until the end of the difusion process. encoder</p>
      </sec>
      <sec id="sec-3-6">
        <title>The generation is thus conditioned on both prompts The first N denoising iterations</title>
        <p>leading to an image that, when the switch is executed at The last M denoising iterations
the right timestep, blends the two concepts. Intuitively,
SWITCH starts by generating the general shape of 1, but then fills out the details based on 2 thus
producing a visual blend of the two concepts.</p>
      </sec>
      <sec id="sec-3-7">
        <title>It is crucial to choose the right iteration to switch the prompt. Unfortunately, this is an intrinsic</title>
        <p>challenge for each new image and does not depend only on the geometric distance between the *1 and
*2 embeddings. From our experiments, it was observed that the optimal iteration for this switch is
directly related to the spatial similarity between the image generated by the model conditioned only
on 1 and the one generated by 2. This technique was also implement in the Stable Difusion web
UI developed by AUTOMATIC1111.3 Among its numerous functionalities, this implementation allows
prompt editing during the mid-generation of an image.</p>
        <sec id="sec-3-7-1">
          <title>3.3. Alternating Prompts in the Iterative Difusion Process (ALTERNATE)</title>
        </sec>
      </sec>
      <sec id="sec-3-8">
        <title>In general difusion models, at each timestep defined by</title>
        <p>the scheduler of the difusion process, the noise in the  
sample is estimated by the U-Net model. This
estimation is performed by the model with knowledge of the prompt
timestep and the conditioning signal (i.e., the prompt). encoder
The Alternating Prompt technique conditions the U-Net itEervaetnion
with a diferent prompt at each timestep: the prompt iteOrdatdion
1 is shown to the U-Net at even timesteps, while 2 is
shown at odd timesteps. By performing this alternating prompt technique, the difusion pipeline can
successfully generate an image that blends the two given prompts. Even though at diferent timesteps,
the U-Net is conditioned by both prompts during the difusion process. The blending ratio can be
controlled by adjusting the number of iterations in which each prompt is shown to the U-Net. One can
intuitively think of this approach as an alternating superposition of the generation process between 1
and 2. This method is also implemented in the Stable Difusion web UI developed by AUTOMATIC1111.
Denoising U-Net</p>
      </sec>
      <sec id="sec-3-9">
        <title>2Compel: https://github.com/damian0815/compel</title>
      </sec>
      <sec id="sec-3-10">
        <title>3Stable Difusion web UI: https://github.com/AUTOMATIC1111/stable-difusion-webui</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Method</title>
      <p>We now propose a diferent blending paradigm to visually combine two textual prompts in the difusion
pipeline. In a standard difusion architecture, given a single input prompt , its corresponding embedding
* is injected with a cross-attention mechanism in the three main blocks of the U-Net: the encoder,
the bottleneck, and the decoder. During the encoding and bottleneck steps, the * embedding is used
to guide the compression of the input sample into a latent representation that accurately maps the
concept  that is being generated. Then, during the decoding phase, the * embedding is used to guide
the reconstruction of the sample towards the distribution of the concept  that is being generated. Our
idea arises from this compression and reconstruction operation and it is described in the following
subsection. To the best of our knowledge, this method has not been proposed before.</p>
      <sec id="sec-4-1">
        <title>Diferent Prompts in Encoder and Decoder Components of the U-Net (UNET)</title>
        <p>We implement our new method using text-based con- Denoising U-Net
ditioning, but it can theoretically be extended to other  
conditioning domains. As describe above, the U-Net
architecture contains three main blocks: the encoder, the prompt
bottleneck and the decoder. Each of these block receives encoder
the prompt embedding * as input together with the
sample from which the noise has to be estimated.</p>
        <sec id="sec-4-1-1">
          <title>The key idea our method involves guiding the com</title>
          <p>pression of the sample into the bottleneck block with a first prompt embedding *1. Then, guide its
reconstruction towards the distribution of the second prompt 2 by injecting into the decoder block the
embedding *2 as visualized in the figure. This allows the U-Net to construct a latent representation for
the sample matching the concept described by 1 and then reconstruct the sample with the features of
the second prompt 2.</p>
          <p>The expected result from this technique is to obtain an image that globally represents or recalls the
concept described by 1 and simultaneously shows some of the features that typically describe the
concept of the second prompt 2. From our findings, changing the prompt embedding in the bottleneck
block does not significantly afect the final result. Consequently, we keep the prompt 1 in the decoder
and bottleneck blocks while we change the prompt 2 in the encoder block.
input prompts
1 2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Validation and Results</title>
      <p>We now describe the experimental setting and analysis made to evaluate the four blending approaches
presented in the previous sections, applied over two simple conceptual prompts. The outputs of these
models can be visualized in Figures 2 and 3. The experiments aimed to assess previously proposed
blending methods across four distinct macro-categories, which are visually explained in Figure 2. The
four categories are pair of animals, object and animal, compound words, and real-life scenarios. These were
selected to showcase diferent kinds of blending of concepts, which are expected to showcase diverging
properties. For pairs of animals, we expect that the shared characteristics between the concepts will aid
the blending process; the use of object and animal concepts in the second category is expected to widen
the semantic gap between the input prompts, leading to more “creative” artifacts. The third category
considers objects representing compound words, ofering a more conceptual blending challenge. Here,
we observed how the methods responded to prompts comprising the compound’s constituent parts,
which are not literal descriptions of the target object but are rather interpretable as a figure of speech
or metaphor. We aimed to investigate whether the models should learn the necessary abstractness to
perform a blending similar to the concept associated with the compound word, or reach a new visual
blending that merges the characteristics of the two prompts. The last category draws inspiration from
real-world visual blend examples, regardless of their underlying concepts, deriving prompts to condition
the models, allowing us to investigate their adaptability and ability to reconstruct well-known blends.</p>
      <p>TEXTUAL SWITCH ALTERNATE UNET
lsa ioLn
imra
n G
A
+
t
c
je in
b a
O r</p>
      <p>B
e
l
t
r
u
T
y
l
F
rsd ttre
o u
WB
d
n
u
o
p
To impartially evaluate the quality of the methods, we conducted a survey on 23 participants involving
24 images from the four categories described. The survey was constructed as follows. We first selected
24 concept pairs covering examples from the four macro-categories; each concept in the pair was
described through a simple prompt. Then, the four diferent blending methods were used to generate
the visual conceptual blend of each pair. All images were generated with the same size and quality, and
presented to the users, with the instruction to rank them according to their blending efectiveness from
best to worst. Our participant pool was carefully selected to ensure they had no prior experience with
blending theory. While the two prompts used to generate the images were provided to the subjects, we
deliberately withheld information regarding the model responsible for each image, eliminating potential
bias. Additionally, to further mitigate bias, the order of images within each question was randomized for
each participant. This approach aimed to discern whether a superior blending method existed among
the four proposed and whether certain methods outperformed others within specific categories.</p>
      <sec id="sec-5-1">
        <title>For each question, the top four images proposed were selected by the authors from a pool generated</title>
        <p>using ten diferent seeds. Given that blending quality across all methods is not entirely independent of
seed choice, we aimed to minimize this dependency by carefully selecting the best results. For a better
understanding of the evaluation approach, Figure 2 shows some of the images that were presented to
the subjects for ranking, along with the methods that produced them.</p>
        <p>Table 1 summarises the results of the survey, indicating the mean and mode (i.e., most frequent) rank
given to each method for each prompt pair and summarizing the results by category and globally. In
both cases, a higher value means a lower quality blend perceived by the subjects of the survey. The
goal of this analysis is to understand which blending method performs better in general (for the global
summary) and in a more fine-grained manner by category and by prompt pair. We emphasise that the
mean value should be handled with care, as a few low rankings (value 4) can greatly skew to mean of a
ranking that is typically considered of high quality. Indeed, in the last row of the table we can observe
that the average ranking of all methods throughout the whole experiment is quite similar, even though
SWITCH is most frequently selected as the best method, and UNET as the worst. Worth noticing is
also that the mode does not necessarily provide a full ranking between methods.</p>
        <p>In the next section we will discuss the merits of the presented blending methods and the results
of the user survey; yet, for the moment we can already see that, at least from the perspective of the
ranking given, there is no clear best blending approach, but quality varies between images, and more
broadly between categories. For instance, UNET was ranked fourth in three categories, but second for
the category of real-life scenarios. Similarly, although UNET’s mode rank in compound words was 4, it
was also the highest ranked in three of the prompt pairs in this category.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Figure 3 shows the results of the four diferent blending methods with the prompts Frog-Lizard,
ButterFly, Kung fu-Panda, Tortoise-Broccoli, and Tea-Pot. To better understand the behavior of each method, all
images in each row were generated using the same seed and thus starting from the same random noise.</p>
      <sec id="sec-6-1">
        <title>Moreover, the blending ratio between the two prompts was kept constant at 0.5 across all methods.</title>
        <p>We measure the visual distance of two concepts by visually evaluating the spatial similarity of the
images generated conditioning the pipeline on them. This is a key aspect to consider when evaluating
the quality of the blend as, with the exception of the TEXTUAL which instead focuses on the semantic
blend, it influences the performance of the blending methods.</p>
        <p>When it comes to logical blends, one often considers a main concept which is modified by a secondary
one. That is, the blended concept is primarily an instance of the main concept, but with some
characteristics that recall the secondary concept. With the exception of TEXTUAL, the blending methods
presented in this paper are not symmetric, meaning that the order of the prompts in the blend afects
the final image. This is particularly important when dealing with compound words like pitbull, although
s
l
a
m
i
n
A
f
O
r
i
a
P
l
a
m
i
n
A
+
t
c
e
j
b
O
s
d
r
o
W
d
n
u
o
p
m
o
C
l
a
e
R</p>
        <p>Prompt
Elephant-Duck</p>
        <p>Lion-Cat
Frog-Lizard
Fox-Hamster</p>
        <p>Rabbit-Dog
CATEGORY TOTAL</p>
        <p>Turtle-Brain
Pig-Cactus</p>
        <p>Garlic-Swan
Coconut-Monkey
Tortoise-Broccoli</p>
        <p>Turtle-Wood</p>
        <p>Turtle-Pizza
CATEGORY TOTAL</p>
        <p>Butter-Fly
Dragon-Fly</p>
        <p>Bull-Pit
Blimp-Whale</p>
        <p>Jelly-Fish
Fire-Fighter</p>
        <p>Tea-Pot
Snow-Flake</p>
        <p>Cup-Cake
CATEGORY TOTAL</p>
        <p>Kung fu-Panda</p>
        <p>Man-Bat</p>
        <p>Beaver-Duck
CATEGORY TOTAL
GLOBAL TOTAL
this word commonly refers to a specific breed of dog, its intrinsic semantic and historical meaning
refers to bull in a pit. When visually blending the two concepts pit and bull with the methods illustrated
in this paper, it is important to take into account which of the two concepts is the main one and which
is the modifier . By analyzing the results in Figure 3, it is evident that this primary-modifier relationship
is not coherent across all the analyzed methods. In TEXTUAL and ALTERNATE, the main concept of
the blend appears to be the second prompt while its modifier is the first one. The opposite behavior
is instead what characterizes SWITCH and UNET where the main concept of the blend is the first
prompt and the modifier is the second one. This behavior was not expected; to keep the experiments
straightforward all blends were generated considering the first prompt as the main concept and the
second as its modifier . This is the reason why when blending the words that make the compound word
Pitbull, the blend is generated as a Bull-Pit instead of Pit-Bull.</p>
        <p>
          As expected from the work by Melzi et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], performing the blending operation in the latent space
of the prompts, as in the case of TEXTUAL, does not always lead to an image that visually blends the
two concepts. This is particularly evident in the case of Kung fu-Panda, where the generated image is a
conceptual blend of the two prompts. From our findings, TEXTUAL usually produces inconsistent
results, although the conditioning embedding given to the pipeline always remains the same, the balance
d
r
a
z
i
L
g
o
r
F
y
l
F
r
e
t
t
u
B
a
d
n
a
P
u
f
g
n
u
K
i
l
o
c
c
o
r
B
e
s
i
o
t
r
o
T
t
o
P
a
e
        </p>
        <p>T
between visual and semantic blending changes drastically from one seed to another. An instance of this
behavior can be observed in its Kung fu-Panda sample at Figure 2. In this case, the model generated
possibly the best visual blend out of the four methods, however, out of all the other seeds tested, no
other sample was able to achieve the same level of blending.</p>
        <p>As mentioned already, results from SWITCH vary considerably depending on the timestep at which
the prompt is switched; finding the right timestep is crucial to achieve a good visual blend. This is
evident in the case Tea-Pot and Butter-Fly shown in Figure 3: the images generated from the prompts
Butter and Fly are visually distant even though both of them start from the same initial noise. When
in the middle of the difusion process the prompt is switched, the model is unable to shift and correct
the existing distribution towards the one of the new prompt and only the first prompt is retained
in the blend. Another undesired behavior of SWITCH is the cartoonification of the produced blend.</p>
      </sec>
      <sec id="sec-6-2">
        <title>The difusion pipeline, when unable to shift the pixel distribution towards the new prompt, corrects</title>
        <p>the existing noisy image latent by progressively removing the high-frequency details, resulting in a
cartoonish image. This behavior can be clearly observed in the Kung fu-Panda blend produced by
SWITCH in Figure 3. From our experimental results, this behavior does not afect the other methods.</p>
        <p>The ALTERNATE method, which alternates between the two prompts at each timestep, tends to
produce consistent results when the two blended concepts are visually very diferent. What is arguably
even more interesting is the type of visual blend that this technique produces when the two concepts
are both visually and semantically very diferent. This is the case of Tea-Pot and Butter-Fly, where
the model creates an image that literally and spatially contains both the first and the second prompt.
This is also evident in the Bull-Pit blend in Figure 2, where the ALTERNATE generates what could be
described as a bull in a pit. TEXTUAL also seems to produce a similar results but once again, it is too
inconsistent across the seeds space to state it as a general rule.</p>
        <p>Compared to the other approaches, the UNET method which encodes in the U-Net the image latent
conditioned with the first prompt and then decodes it with the second one, produces more subtle blends
but generally consistent results. This might be the reason why this is the blend method that performs
worse in the survey, as the visual blend is not as evident as in the other methods. Interestingly, on the
Kung fu-Panda blend, UNET seems to slightly change the visual representation of the first prompt
while matching the colors of the second one. This subtle blending is also evident in the Bull-Pit blend
of Figure 2, where surprisingly the pipeline creates an image that somewhat resembles a pitbull.</p>
        <p>The results of the survey summarized in Table 1 show that the most preferred method is SWITCH,
however, this comes with some considerations. In order to better represent each method, in the survey
we have chosen the best settings for each method, in the case of SWITCH this translates into using the
optimal timestep in which to switch the prompt for each blend. Finding this value is a tedious process
made by trial and error, with no clear and empirical way to determine it. Although UNET ranked
the lowest in the survey, while comparing its results with the ones of SWITCH with a fixed switch
timestep in the middle of the difusion process (Figure 3), it is evident that the visual blend produced by
these two methods are generally similar if not better in the case of UNET.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>
        Through this paper we tried to answer a novel research question: is it possible to produce visual
concept blends through difusion models? We compared diferent possible solutions to force a difusion
model (more specifically Stable Difusion [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) to generate contents that represent the blend of two
separated concepts. We collected three diferent alternatives from existing publications and from the
web. Additionally, we propose a completely new method, which we call UNET that exploits the internal
architecture of the adopted difusion model. We collected the outputs of the diferent methods on 4
diferent categories of test; namely, pairs of animals, animal and object, compound words, and real life
scenarios. For each of these categories we produced various diferent pairs of concepts, and generated
all blends (in total, four blended images for each pair of prompts).
      </p>
      <sec id="sec-7-1">
        <title>The quality of a blend, as any creative endeavor, has a subjective component on it. Thus, to evaluate</title>
        <p>which approach is more adept at this task (in relation to human perception) we devised a user study
that was run by 23 subjects. In it, participants were asked to rank the results of the blending methods.</p>
      </sec>
      <sec id="sec-7-2">
        <title>It is worth noting that two participants did not rank all methods, but 21 full surveys were submitted.</title>
      </sec>
      <sec id="sec-7-3">
        <title>We still used the partial surveys to compare those pairs where the ranks were available.</title>
      </sec>
      <sec id="sec-7-4">
        <title>From the user study it results that there is no single best blending method, but the perceived quality</title>
        <p>varies from pair to pair and, more importantly, from category to category. And yet, from a positive
perspective, we can answer our research question on the afirmative: is it possible to produce visual
conceptual blends through difusion models, and the results are often quite compelling (see the samples
in Figure 2. Indeed, the survey participants expressed surprise with some of them.</p>
      </sec>
      <sec id="sec-7-5">
        <title>An important point to make is that, for this work, we used the latent space from Stable Difusion</title>
        <p>directly; that is, without any kind of fine-tuning or added training. Thus, our results are less fragile
towards model updates, and do not require significant efort to implement and execute. This is consistent
with our original stated goal of understanding how to manipulate the latent space as a representation
of concepts. This work only scratches the surface of this topic and we hope that it can inspire new
discussion and further analysis.</p>
      </sec>
      <sec id="sec-7-6">
        <title>For future work, note that our blends are based on very simple (mainly one-worded) prompts. This</title>
        <p>allows us to better understand the impact of the operations (in contrast to the subtleties of
promptengineering) but has the disadvantage of working over very general concepts, in general, and more in
particular is prone to ambiguities and misinterpretations. It would thus be interesting to explore ways
to guarantee a more specific identification of concepts selected for blending.</p>
      </sec>
      <sec id="sec-7-7">
        <title>Work funded by the European Union–Next Generation EU within the project NRPP M4C2, Invest</title>
        <p>ment 1.,3 DD. 341-15 march 2022–FAIR; Future Artificial Intelligence Research – Spoke
4-PE00000013</p>
      </sec>
      <sec id="sec-7-8">
        <title>D53C22002380006. Part of this work was supported by the MUR for REGAINS, the Department of</title>
      </sec>
      <sec id="sec-7-9">
        <title>Excellence DISCo at the University of Milano-Bicocca, the PRIN project PINPOINT Prot. 2020FNEB27,</title>
      </sec>
      <sec id="sec-7-10">
        <title>CUP H45E21000210001, and by the NVIDIA Corporation with the RTX A5000 GPUs granted through the Academic Hardware Grant Program to the University of Milano-Bicocca for the project “Learned representations for implicit binary operations on real-world 2D-3D data.”</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Brachman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Levesque</surname>
          </string-name>
          ,
          <article-title>Knowledge representation and reasoning</article-title>
          , Morgan Kaufmann,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gardenfors</surname>
          </string-name>
          , Conceptual Spaces:
          <article-title>The Geometry of Thought, A Bradford book</article-title>
          , MIT Press,
          <year>2004</year>
          . URL: https://books.google.it/books?id=
          <fpage>FSLFjw1EcBwC</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Fauconnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <article-title>The Way We Think: Conceptual Blending And The Mind's Hidden Complexities</article-title>
          , Basic Books,
          <year>2008</year>
          . URL: https://books.google.it/books?id=FdOLriVyzwkC.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent difusion models</article-title>
          ,
          <source>in: Proc. IEEE/CVF conf. on comp. vision and pattern recog.</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>10684</fpage>
          -
          <lpage>10695</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Podell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>English</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lacey</surname>
          </string-name>
          , et al.,
          <article-title>Sdxl: Improving latent difusion models for high-resolution image synthesis</article-title>
          ,
          <source>arXiv preprint arXiv:2307</source>
          .
          <year>01952</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sohl-Dickstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Maheswaranathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ganguli</surname>
          </string-name>
          ,
          <article-title>Deep unsupervised learning using nonequilibrium thermodynamics</article-title>
          ,
          <source>in: Proc. ICML'15</source>
          , volume
          <volume>37</volume>
          , PMLR, Lille, France,
          <year>2015</year>
          , pp.
          <fpage>2256</fpage>
          -
          <lpage>2265</lpage>
          . URL: https://proceedings.mlr.press/v37/sohl-dickstein15.
          <fpage>html</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Auto-encoding variational bayes</article-title>
          ,
          <source>arXiv preprint arXiv:1312.6114</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation</article-title>
          ,
          <source>in: Proc. MICCAI</source>
          <year>2015</year>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Confalonieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pease</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schorlemmer</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Concept</surname>
            <given-names>Invention</given-names>
          </string-name>
          : Foundations, Implementation, Social Aspects and Applications,
          <source>Computational Synthesis and Creative Systems</source>
          , Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Costello</surname>
          </string-name>
          , M. T. Keane,
          <article-title>Eficient creativity: Constraint-guided conceptual combination</article-title>
          ,
          <source>Cognitive Science 24</source>
          (
          <year>2000</year>
          )
          <fpage>299</fpage>
          -
          <lpage>349</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Fauconnier</surname>
          </string-name>
          ,
          <article-title>Mental spaces: Aspects of meaning construction in natural language</article-title>
          ,
          <source>CUP</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Sherry</surname>
          </string-name>
          Jr,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deschenes</surname>
          </string-name>
          ,
          <article-title>Conceptual blending in advertising</article-title>
          ,
          <source>Journal of business research 62</source>
          (
          <year>2009</year>
          )
          <fpage>39</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kutz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bateman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neuhaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mossakowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bhatt</surname>
          </string-name>
          , E pluribus unum: Formalisation,
          <article-title>use-cases, and computational support for conceptual blending</article-title>
          ,
          <source>in: Computational Creativity Research: Towards Creative Machines</source>
          , Springer,
          <year>2014</year>
          , pp.
          <fpage>167</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. F.</given-names>
            <surname>McQuarrie</surname>
          </string-name>
          ,
          <article-title>Beyond visual metaphor: A new typology of visual rhetoric in advertising</article-title>
          ,
          <source>Marketing theory 4</source>
          (
          <year>2004</year>
          )
          <fpage>113</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Linkola</surname>
          </string-name>
          , et al.,
          <article-title>Vismantic: Meaning-making with images</article-title>
          ., in: ICCC,
          <year>2015</year>
          , pp.
          <fpage>158</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L. B.</given-names>
            <surname>Chilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Ozmen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Ross</surname>
          </string-name>
          , V. Liu, Visifit:
          <article-title>Structuring iterative improvement for novice designers</article-title>
          ,
          <source>in: Proc. 2021 CHI Conf. on Human Factors in Computing Systems</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cunha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Machado</surname>
          </string-name>
          ,
          <article-title>Let's figure this out: A roadmap for visual conceptual blending</article-title>
          ,
          <source>in: Proc. of International Conference on Innovative Computing and Cloud Computing</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Melzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Peñaloza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raganato</surname>
          </string-name>
          ,
          <article-title>Does stable difusion dream of electric sheep?</article-title>
          ,
          <source>in: Proc. ISD7</source>
          , volume
          <volume>3511</volume>
          <source>of CEUR, CEUR-WS.org</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3511</volume>
          /paper_09.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Unipc: A unified predictor-corrector framework for fast sampling of difusion models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2302</volume>
          .
          <fpage>04867</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: Proc. of International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>