<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DreamShot: Teaching Cinema Shots to Latent Difusion Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tommaso Massaglia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bartolomeo Vacchetti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tania Cerquitelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Polytechnic of Turin</institution>
          ,
          <addr-line>24 Corso Duca degli Abruzzi, Turin, 10129</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, several text-image synthesis models have been released that are increasingly capable of synthesizing realistic images close to the input. Among the various state-of-the-art techniques and models, the introduction of the open-source latent difusion model Stable Difusion [ 1] has led to significant developments in text-to-image generation in recent months. By using techniques such as DreamBoot [2] and Textual Inversion [3], it is possible to refine further and control the generation process to produce even more specific output than text alone would allow. We test this approach for generating three specific cinematographic shot types: Close-up, Medium Shot, and Long Shot. By fine-tuning based on Stable Difusion 1.5 using a small dataset of 600 labelled and captioned film frames, we achieve a noticeable increase in CLIP -T and DINO scores and an overall noticeable qualitative improvement (as indicated by our human-run evaluation survey) in image likability, compliance, and shot type correctness.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Difusion Models</kwd>
        <kwd>Shot Types</kwd>
        <kwd>text to image</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Image generation has seen a major rise in popularity since
the release of the Difusion Model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] architecture, with
improvements in the quality of the generations that made
the pictures ever so close to realistic art pieces and photos.
Being able to generate realistic pictures that follow a given
textual description through the use of models such as the
Latent Difusion [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] based Stable Difusion [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] opens up a
multitude of previously unattainable tasks, which are
further improved by the ability to add new subjects in a simple
way provided by DreamBooth [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. By using these two
techniques it would be possible to, for example, automatically
generate an advertising campaign for a novel product or
perform seamless photo editing through textual instructions.
Notably, cinema heavily relies on the utilization and
creation of reference images to enhance workflow eficiency.
With the capacity to generate realistic images, generating
expressive reference images that precisely convey the
intended shot becomes readily accessible to all, eliminating
the need for an extensive reference library or artistic
drawing skills. These reference images and sketches are widely
employed in storyboarding, an essential film-making
technique that aids in visualizing the narrative and streamlining
the filming process. Within this context, the selection of the
desired shot type plays an important role, as it significantly
influences the audience’s focus and emotions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>type
DreamBooth Checkpoint
Lora DreamBooth
Textual Inversion
number</p>
      <p>downloads</p>
      <p>
        To the best of our knowledge, the use of text-to-image
generation models and their fine-tuning in this context
remains widely unexplored. In this paper, we explore the
use of DreamBooth [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (as it is the most widely used
finetuning approach for pre-trained Latent Difusion models, as
shown in table 1) in adding the knowledge of three specific
shot types, close-shot, medium-shot, and long shot, to a
pre-trained version of stable-difusion-v-1-5 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Given a
textual input and a desired shot scale, our methodology is able
to generate synthetic scenes that are semantically close to
the input and to the scale selected. Using the same testing
setup that was proposed in the original DreamBooth [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
paper, we achieve an improvement over the baseline model
in both CLIP-T [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and DINO [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] scores. We complement
this testing with a survey conducted on 55 subjects which
further shows the qualitative improvements achieved by
our approach.
      </p>
      <p>Our contributions are the following: the outlining of a
methodological approach to fine-tuning an existing latent
difusion model with state-of-the-art techniques
(DreamBooth)to teach a new style; the steps necessary to build a
training set out of unlabeled movie shots in order to
finetune a pre-trained model; a set of three fine-tuned models
catered towards the generations of three specific shot types:
close shot, medium shot, and long shot.</p>
      <p>The paper is organized as follows: Section 3 covers the
methodology and describes the techniques on which our
approach relies; Section 2 discusses the methods exploited
in the proposed methodology; Section 4 outlines the testing
procedure, metrics used, and relevant results.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Storyboarding</title>
        <p>
          In recent years a growing number of studies have focused on
the automation of video editing tasks. While these works,
such as [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], achieve impressive performance in
the generation of a video, either given as input a textual
prompt [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], or a combination of textual prompt and image
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], they focus on the generation of motion and do not take
into account the shot type used.
        </p>
        <p>
          By generating more scenographic shots, one of the many
applications that become available is text-to-image
storyboard creation. Existing storyboarding tools either extend
digital painting applications (e.g. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]), allow the user to
place predetermined objects in a scene to compose the
desired frame (e.g. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]), provide a simple interface to create a
reference of the desired scene (e.g. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]).
        </p>
        <p>
          For more deep learning-related approaches, StoryGAN
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] generates a sequence of images that describe a story
written in a multi-sequence paragraph. To do this, the
proposed framework uses a sequential Generative Adversarial
Network [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] that consists of a Story Encoder, an RNN-based
Context Encoder, an image generator conditioned on the
story context, and an image/story discriminator that ensures
consistency. Difusion Models allow for high-quality
generation on multiple domains without needing specific training,
and a better understanding of the conditional text input than
GANs. The conditioning based on previous frames could
be a possible approach for increased temporal consistency
even in LDMs.
        </p>
        <p>
          Dynamic Storyboarding [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] approaches the
storyboarding task directly by automatically composing scenes out of
user inputs by simulating in a virtual environment the scene
and discriminating the best proposal out of the available
ones. This approach generates rich and complex dynamic
(video) storyboards, but it lacks the customizability and
intuitiveness that Difusion Models ofer through textual
conditioning. Furthermore, by using ControlNet 2.5 trained
networks it’s possible to add conditioning through more
inputs such as scribbles, which at the cost of a slightly higher
efort can lead to much better generations.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Text-to-Image Difusion Model</title>
        <p>
          Difusion models are a type of probabilistic generative
models that generate samples from a learned distribution by
reversing the "difusion process", modeled as a Markov process
of gradual Gaussian noise addition. The generative process
is carried by gradually removing noise from a random initial
sample. A text-to-image difusion model   , given an noise
map  ∼  (0, 1) at timestep  and a conditioning vector
 =   () generated using text encoder   and prompt ,
generates an image   (, ,   ()). During training, the
sample generated using the conditioning   () is compared
to its original counterpart  . The loss is computed as:
 = E, ∼ (0,1),[|| −   (, ,   ())||22],
(1)
where both   and   are jointly optimized during
training.
2.3. CLIP
CLIP [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], short for Contrastive Language Image Pretraining,
is a technique developed to approach the zero shot
classiifcation task by learning the contents of an image directly
from raw text description of it rather than from labels (such
as the classes found in the ImageNet dataset). By learning
from natural language, the resulting model is much easier to
scale compared to standard crowd-sourced dataset thanks
to the vast amount of text available on the internet. The
representation that is learned with CLIP is tightly connected
to language, which enables flexible zero shot transfer. Given
a batch of  (text, image) pairs, CLIP is trained to predict
which of the  ×  possible pairings across a batch actually
occurred. To do this, CLIP learns a multi-modal embedding
space by jointly training an image encoder (based on a
vision transformer) and a text encoder to maximize the cosine
similarity of the image and text embeddings of the  real
pairs, while minimizing the cosine similarity of the  2 − 
incorrect pairings.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Latent Difusion</title>
        <p>
          Latent Difusion Models are introduced in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] which
proposes to move the difusion process from the
computationally expensive pixel space to a less intensive latent space.
Given an image  ∈ R×  × 3 in RGB space, the encoder
ℰ encodes  into a latent representation  = ℰ (), and the
decoder  reconstructs the image from the latent, giving
˜ = () = (ℰ ()). Thanks to the latent representation
enabled by ℰ and , likelihood-based modelling becomes
a more suitable task as higher complexity details are
abstracted away and the learning can focus on the important
semantic bits of the data. Rather than using an
autoregressive, attention-based approach, image-specific inductive
biases can be taken advantage of. The underlying UNet is built
primarily from 2D convolutional layers. Diferent forms of
conditioning can be applied during generation such as
image maps and text (which uses CLIP encodings to generate
the conditioning tokens); the text-to-image generation
process is carried by feeding as input a random noise vector
and a textual prompt to the denoising U-net of the model.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.5. ControlNet</title>
        <p>
          Described in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], ControlNet is a network structure
developed to support additional input conditions in existing
difusion models; rather than controlling the synthesis of
images only through text or an input image, ControlNet
allows to use of inputs such as canny mapsand depth maps
and poses as inputs for the denoising process, even
combining them in the same process, allowing for an increased
level of control on the output.
        </p>
        <p>
          ControlNet works by creating a trainable copy and a
locked copy of an existing large difusion model; the locked
copy preserves the network capabilities learned from billion
of images, while the trainable copy is trained on task-specific
datasets to learn the conditional control. The two networks
are then connected using a new type of convolution layer
called zero convolution. Only the first half of the denoising
U-Net is trained and the encoder blocks are connected to
their respective decoder blocks through zero convolutions.
Video ControlNet [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] proposes an approach that enhances
temporal consistency when converting an existing video
using Stable Difusion.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>
        Modern difusion models can increasingly produce
photorealistic images through conditional generation that are
almost indistinguishable from the human eye. The most
common form of conditioning is through text (called ’prompt’).
By encoding text and using the resulting encodings in the
cross-attentional layers of the denoising U-network as
conditioning, it is possible to influence the generation process
toward a desired outcome. In most cases, however, the amount
of control we can exert over the output is limited and
requires either specialized prompt engineering or fine-tuning
to teach the model how to better represent the desired
concept. Extensive fine-tuning can be prohibitively expensive
and requires multiple GPU hours on a cluster. To solve this
problem, techniques have been developed to quickly add
new themes or styles to an existing large difusion model
like DreamBooth[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The intuition behind our approach is that learning a shot
type is similar in a way to learning a style (if a painter always
painted portraits his "style" would always have the subject
close to the camera), and as such we could use DreamBooth
capabilities to teach an existing Latent Difusion Model what
diferent shot types are.ì</p>
      <p>
        Figure (1) outlines the basic steps we adopted to fine-tune
the model. The particular DreamBooth implementation we
used leverages Low Rank Adaptation (LoRa) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] to
significantly reduce training time and more easily create shareable
checkpoints. The entire process consists of creating a
wellconstructed dataset, since the quality of the training images
and labels greatly afects the output model, selecting a base
model for fine-tuning, and creating a ∆  . We refer to the
base model as  and the fine-tuned model as  ′, such that
 ′ =  + ∆  . ∆  contains the learned weights that
can then be invoked during inference to be applied to the
selected base.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Training set creation</title>
        <p>
          The training set that is used when finetuning a pre-trained
difusion model is one of the most important contributors
to the output quality. As the model learns to reproduce the
contents of the training set, by having high-quality samples,
the generated image quality will improve as well. Another
important aspect of the training set is the caption that is
associated with each image. The way DreamBooth adds
knowledge to a pre-trained model is by learning the
concepts of the input image that the original model doesn’t
already possess in its prior knowledge. In our case, the
caption associated with each shot should include a highly
accurate description of the shot so that the model would
pick up the concept of the shot scale and not other already
known ones. To reach this goal, which is the creation of
a task-specific training set, we define a 5 steps approach
that can be applied to any large dataset of movie shots. (i)
Data Collection: the first step is to acquire a large enough
dataset to use as a base; movie shots datasets have a wide
range of image quality, so it’s suggested to start from a
large enough one in order to have a guarantee of having
enough high-quality samples. (ii) Filtering: depending
on the metadata available of the chosen dataset, filtering
out the lower-quality images, even with arbitrary filters,
can largely improve the speed of the subsequent steps. (iii)
Cropping: the required resolution for images when
finetuning Stable Difusion is 1 × 1, with the most used sizes
being 768 × 768, 512 × 512 and 256 × 256. By using a
content-aware cropping method it’s possible to obtain the
necessary image size in a quick way while keeping the most
important part of the shot. (iv) Labeling and shot
selection: as there is no precise enough approach for automatic
shot labelling and the shots require close supervision for
the quality of the image and the crop, labelling by hand
becomes a necessity. By sampling without repetition from
the available pool of images and assigning the correct label,
it’s possible to quickly handpick and label the necessary
shots, which should range between 100 and 200 for styles.
A good movie variety should be kept to not teach unwanted
subjects. (v) Captioning: once the required images per
shot scale are reached, a first basic caption can be generated
by using models such as blip-2 [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], which also have the
advantage of generating captions that resemble the CLIP
description style. Once again, human supervision is highly
suggested for the generated captions.
        </p>
        <p>Once the dataset is correctly prepared, the training can
begin.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Training</title>
        <p>
          In order to finetune the LDM we used DreamBooth [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The
idea behind DreamBooth is to, given a few input images
(≈ 3 − 5), bind the subject to a unique identifier such that
when it is used in the prompt along with the class it belongs
to (e.g. "A [V] dog"), the prior knowledge of the class is used
along the new information to reconstruct the subject. A new
autogenous class-specific prior preservation loss is
introduced on top of the regular training objective to encourage
diversity and counter language drift. During training, the
model is supervised with its own generated samples in order
to retain the prior knowledge of the class and to use it along
with the knowledge of the subject instance to generate new
samples.
        </p>
        <p>
          By itself, DreamBooth already manages to significantly
decrease the cost of adding a subject to an existing model.
But, as a further optimization, we used Low Rank
Adaptation [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] applied to the DreamBooth process [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. LoRa
allows eficient finetuning even in low-power devices while
keeping a high-quality end-result. Instead of training the
entire model, LoRa works by finetuning the residual: i.e.
train ∆  instead of  .
        </p>
        <p>′ =  + ∆ 
(2)</p>
        <p>Through matrix decomposition it’s possible to further
decrease the amount of parameters to finetune, hence reducing
the size of the output model by an even larger degree.
∆  = 
(3)</p>
        <p>The attention layers parameters of the cross-attention
layers in the denoising U-Net of Stable Difusion are enough
to tune to obtain the desired output.</p>
        <p>Given an existing difusion model  , a LoRa of it is
applied on top in the form of  ′ =  +  ∆  : when  is
0 the model is the same as the original one when  is 1 the
model is the same as the fully finetuned one. Applying this
form of optimization to DreamBooth makes it possible to
achieve two primary goals: faster and less complex training
and a lightweight and more versatile output.</p>
        <p>Once the training phase is finished, an output file is
produced which contains the weights learned during training.
The model is then used alongside the original one that was
used as a base during the finetuning process (in this case
stable-difusion-v1-5 ) to synthesize images.</p>
        <p>In our specific case, no unique identifier was specified
during training; by not binding the concept to a specific
token, the model always generates in the trained style (or
shot type in our case) when the ∆  model is specified in
the prompt.</p>
        <p>The caption in figure (2) is the prompt that was used to
generate the picture. The token ” ∆  ” is a placeholder
control sequence that is added in the prompt to add the
weights and layers from the LoRa (∆  , closeshot in this
case) to the pretrained full model that’s being used for the
generation with weight  .
Once the model is successfully trained, the generative
process can begin. Generation is performed by providing the
model with a series of parameters along with a textual
prompt describing the scene. The prompt can be either
in the positive field, where the generation is moved towards
the conditioning, or the negative field, where the model
generates away from the concepts specified in the negative
ifeld. Prompt engineering takes a big role in the generative
process, with certain prompts such as "high quality" and
"masterpiece" guiding the generated image towards more
aesthetically pleasing results. The most meaningful
generation parameters are:
• Sampler: at each step of the difusion process a
certain amount of noise is predicted and subtracted
from the image. The sampler takes care of both
computing the predicted noise and scheduling the noise
level at each sampling step so that an equally noisy
image can be sampled. There are many available
with diferent benefits.
• Steps: changes how much noise is subtracted from
the image at each step, the larger the number of
steps the slower the generation process is, but finer
details might be developed this way.
• CFG Scale: short for Classifier Free Guidance scale,
classifier free guidance is a technique that moves
the generated samples away from random unlabeled
ones, essentially making the generated image adhere
more to the provided prompt.
• Seed: determines the initial noise map, diferent
seeds will result in diferent images.</p>
        <p>Furthermore, the value  that determines how much the
∆  model weights are applied takes an important role in
the generative process. As there is no deterministically
perfect way to train a DreamBooth model, sometimes lowering
how much influence the finetune has can improve results.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Preliminary Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Training Set</title>
        <p>Among the many available movie repositories,
[FILMGRAB]1 was chosen as it provides high quality, hand picked
movie frames.</p>
        <p>We began by collecting 127.000 shots from 2166 movies.
All the pictures with less than 3 color channels were pruned,
as well as the ones coming from movies released before
2013 to guarantee a certain degree of image quality and
resolution. The shots were then cropped using content-aware
image cropping to the size of 512 × 512 pixels because of
computational constraints. Out of the remaining 41.750,
only 600 (200 per shot type) were then to be selected. As the
number of required pictures is relatively small, shot-type
selection and labelling was performed by hand.
Randomization was achieved by sampling single shots from all the
available ones and by assigning a label, adding it to the
training set if and only if the quality and crop were deemed
to be appropriate. As the training set is small, the training
is very sensitive to bad samples.
1Open source for research purposes.</p>
        <p>
          The final step was adding textual captions. To aid in the
captioning process, the Vision-Language model
blip2-flant5-xl [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] was used to generate a first CLIP [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] style caption
with human supervision.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Testing Set</title>
        <p>
          The dataset used for testing is composed of 1800 shots
sampled from the filtered 41.750 shots evenly distributed
between the three shot types (long shot, medium shot, close
shot), and their respective caption generated using BLIP2
[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] without supervision. The generated captions were not
supervised for testing purposes. The collected captions were
then randomly sampled and used to generate two pictures
from the same starting seed  times, one with and one
without training, for a total of 1500 pairs of "trained" and
"non-trained" images, evenly split between shot types, with
generation parameters 2.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Metrics</title>
        <p>
          To get a quantitative result two metrics were adopted
following in the footsteps of the original DreamBooth [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
implementation. The first one is CLIP-T [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], the average pairwise
cosine similarity between the clip embeddings of the
generated image and the prompt that generated it. The second
metric, DINO [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], measures the average pairwise cosine
similarity between the ViTS/16 DINO embeddings of
generated and real images, essentially measuring how similar
the generated image is to its real counterpart. The results
shown in 3 show a slight (although significant for the
considered metrics) increase for both the CLIP-T and DINO
scores over the baseline model. The lower increase seen in
the CLIP-T compared to the DINO metric is justified as the
model doesn’t learn to represent more concepts (so from a
CLIP perspective the objects present in the picture are the
same) with our finetuning, but instead learns to represent
them closer to the training image, especially from a camera
distance perspective. From a qualitative analysis, it appears
that the fine-tuned model is more often able to generate
images that are semantically close to the prompt used to
generate them. Sometimes it even generates elements that
are present in the prompt that the baseline model ignored
(e.g., a person when two were specified, a car that is not
present). In addition, since there is no free lunch, although
it has not been tested on other tasks, we expect the
finetuned model to perform worse on other generative tasks,
and in the generated examples we can see that it more often
generates faces similar to those shown during training.
        </p>
        <p>As a secondary and ablation study, 600 additional image
pairs were generated using the same setup as before, but
removing all information regarding the acquisition type
from the text conditioning. Looking at the results of the
DINO score in Table 4, it can be seen that the images
generated with the fine-tuned model still have a higher DINO
ours</p>
        <p>DINO</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Qualitative Survey</title>
        <p>We conducted in addition a survey of human subjects. Each
subject was shown a total of 36 pairs of images  and 
generated with the same setting and prompt, one from the
baseline model and one from the finetuned one. Whether an
image was labelled  or  was randomized. The generated
patterns were monitored in a very light form to ensure that
the images were safe for all. Each image pair was shown
along with its associated shot type and generator prompt.
For each image pair, three questions were asked: (i) Which
image do you like best?; (ii) Which image corresponds more
to the associated shot type?; (iii) Which image corresponds
more to the associated prompt?</p>
        <p>The possible answers for each question were , , or
neither/same if the two images were considered equivalent
in some aspect. A total of 55 subjects responded to the
survey, and the results are reported in Table 5. It can be seen
that even with human evaluation, our approach generates
images that are more appealing and closer to the associated
shot type and prompt in almost or more than half of the
cases.</p>
        <p>Aside from image likability, the baseline model obtained
the lowest score of the three, indicating that the generation
is of equal quality to the generation without fine-tuning
in most cases. The results are consistent, comparing the
survey to CLIP -T and DINO metrics. The higher likeability
and shot-type closeness are directly related to DINO and
are noticeably higher than prompt closeness and CLIP-T
compared to the baseline.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future</title>
    </sec>
    <sec id="sec-6">
      <title>Developments</title>
      <p>
        We have presented an approach that uses novel techniques
such as DreamBooth and LoRa to finetune an existing
latent difusion model to generate specific types of shot types.
Based on the intuition that learning a shot type is similar
to learning a style, which DreamBooth was shown to be
capable of, we achieve improvements in both compliance
and similarity of reference images by using only 200
images for each shot type, as shown by CLIP -T, DINO, and
even human evaluation metrics. We test our approach on
a storyboarding task showing the potential uses of
modern LDMs in video production, mainly when supported by
domain-specific training. Furthermore, novel techniques,
such as ControlNet open the doors to even more specific
conditioning forms. Developments such as [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] show the
power that ControlNet ofers, and applying the technique
for cinematic purposes could be an interesting development
point. Regarding our work, as DreamBooth training is far
from a solved task, more tests could yield even better results.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Stability</surname>
          </string-name>
          <string-name>
            <surname>AI</surname>
          </string-name>
          ,
          <article-title>Stable difusion release blog post</article-title>
          , https://stability.ai/blog/ stable
          <article-title>-difusion-public-</article-title>
          <string-name>
            <surname>release</surname>
          </string-name>
          ,
          <year>2022</year>
          . (accessed 23-May-
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jampani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pritch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rubinstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberman</surname>
          </string-name>
          , Dreambooth:
          <article-title>Fine tuning text-to-image difusion models for subject-driven generation</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2208</volume>
          .
          <fpage>12242</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Alaluf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Atzmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Patashnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Bermano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chechik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cohen-Or</surname>
          </string-name>
          ,
          <article-title>An image is worth one word: Personalizing text-to-image generation using textual inversion</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2208</volume>
          .
          <fpage>01618</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , Denoising difusion probabilistic models,
          <year>2020</year>
          . arXiv:
          <year>2006</year>
          .11239.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent diffusion models</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2112</volume>
          .
          <fpage>10752</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Rooney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Bálint</surname>
          </string-name>
          ,
          <article-title>Watching more closely: Shot scale afects film viewers' theory of mind tendency but not ability</article-title>
          ,
          <source>Frontiers in Psychology</source>
          <volume>8</volume>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          . 3389/fpsyg.
          <year>2017</year>
          .
          <volume>02349</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2103</volume>
          .
          <fpage>00020</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Caron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          , I. Misra,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mairal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <article-title>Emerging properties in self-supervised vision transformers</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2104</volume>
          .
          <fpage>14294</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Molad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Horwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Valevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Acha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pritch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Leviathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hoshen</surname>
          </string-name>
          , Dreamix: Video diffusion models are general video editors,
          <source>arXiv preprint arXiv:2302.01329</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>U.</given-names>
            <surname>Singer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polyak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ashual</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Gafni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Taigman</surname>
          </string-name>
          ,
          <article-title>Make-a-video: Text-to-video generation without text-video data</article-title>
          ,
          <source>ArXiv abs/2209</source>
          .14792 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Storyboarder</surname>
          </string-name>
          , https://wonderunit.com/storyboarder/, ????.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Storyboardthat</surname>
          </string-name>
          , https://www.storyboardthat.com/, ????.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Studiobinder</surname>
          </string-name>
          , https://www.studiobinder.com/ storyboard-creator/, ????
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Y. Cheng, Y. Wu,
          <string-name>
            <given-names>L.</given-names>
            <surname>Carin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carlson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Storygan: A sequential conditional gan for story visualization</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1812</year>
          .02784.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pouget-Abadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Generative adversarial networks,
          <year>2014</year>
          . arXiv:
          <volume>1406</volume>
          .
          <fpage>2661</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>Dynamic storyboard generation in an enginebased virtual environment for video production</article-title>
          ,
          <source>ArXiv abs/2301</source>
          .12688 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Agrawala,
          <article-title>Adding conditional control to text-to-image difusion models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2302</volume>
          .
          <fpage>05543</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Chu</surname>
          </string-name>
          , S.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image difusion models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>19193</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2301</volume>
          .
          <fpage>12597</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>S. R.</surname>
          </string-name>
          <article-title>aka cloneofsimo, lora</article-title>
          , https://github.com/ cloneofsimo/lora,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>