=Paper=
{{Paper
|id=Vol-3651/DARLI-AP_paper8
|storemode=property
|title=DreamShot: Teaching Cinema Shots to Latent Diffusion Models
|pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-8.pdf
|volume=Vol-3651
|authors=Tommaso Massaglia,Bartolomeo Vacchetti,Tania Cerquitelli
|dblpUrl=https://dblp.org/rec/conf/edbt/MassagliaVC24
}}
==DreamShot: Teaching Cinema Shots to Latent Diffusion Models==
<pdf width="1500px">https://ceur-ws.org/Vol-3651/DARLI-AP-8.pdf</pdf>
<pre>
                         DreamShot: Teaching Cinema Shots to Latent Diffusion Models
                         Tommaso Massaglia1,* , Bartolomeo Vacchetti1,† and Tania Cerquitelli1
                         1
                             Polytechnic of Turin, 24 Corso Duca degli Abruzzi, Turin, 10129, Italy


                                              Abstract
                                              In recent years, several text-image synthesis models have been released that are increasingly capable of synthesizing realistic images
                                              close to the input. Among the various state-of-the-art techniques and models, the introduction of the open-source latent diffusion
                                              model Stable Diffusion [1] has led to significant developments in text-to-image generation in recent months. By using techniques such
                                              as DreamBoot [2] and Textual Inversion [3], it is possible to refine further and control the generation process to produce even more
                                              specific output than text alone would allow. We test this approach for generating three specific cinematographic shot types: Close-up,
                                              Medium Shot, and Long Shot. By fine-tuning based on Stable Diffusion 1.5 using a small dataset of 600 labelled and captioned film
                                              frames, we achieve a noticeable increase in CLIP -T and DINO scores and an overall noticeable qualitative improvement (as indicated by
                                              our human-run evaluation survey) in image likability, compliance, and shot type correctness.

                                              Keywords
                                              Diffusion Models, Shot Types, text to image


                         1. Introduction                                                                                                      generation models and their fine-tuning in this context re-
                                                                                                                                              mains widely unexplored. In this paper, we explore the
                         Image generation has seen a major rise in popularity since                                                           use of DreamBooth [2] (as it is the most widely used fine-
                         the release of the Diffusion Model [4] architecture, with                                                            tuning approach for pre-trained Latent Diffusion models, as
                         improvements in the quality of the generations that made                                                             shown in table 1) in adding the knowledge of three specific
                         the pictures ever so close to realistic art pieces and photos.                                                       shot types, close-shot, medium-shot, and long shot, to a
                         Being able to generate realistic pictures that follow a given                                                        pre-trained version of stable-diffusion-v-1-5 [1]. Given a tex-
                         textual description through the use of models such as the                                                            tual input and a desired shot scale, our methodology is able
                         Latent Diffusion [5] based Stable Diffusion [1] opens up a                                                           to generate synthetic scenes that are semantically close to
                         multitude of previously unattainable tasks, which are fur-                                                           the input and to the scale selected. Using the same testing
                         ther improved by the ability to add new subjects in a simple                                                         setup that was proposed in the original DreamBooth [2]
                         way provided by DreamBooth [2]. By using these two tech-                                                             paper, we achieve an improvement over the baseline model
                         niques it would be possible to, for example, automatically                                                           in both CLIP-T [7] and DINO [8] scores. We complement
                         generate an advertising campaign for a novel product or                                                              this testing with a survey conducted on 55 subjects which
                         perform seamless photo editing through textual instructions.                                                         further shows the qualitative improvements achieved by
                         Notably, cinema heavily relies on the utilization and cre-                                                           our approach.
                         ation of reference images to enhance workflow efficiency.                                                               Our contributions are the following: the outlining of a
                         With the capacity to generate realistic images, generating                                                           methodological approach to fine-tuning an existing latent
                         expressive reference images that precisely convey the in-                                                            diffusion model with state-of-the-art techniques (Dream-
                         tended shot becomes readily accessible to all, eliminating                                                           Booth)to teach a new style; the steps necessary to build a
                         the need for an extensive reference library or artistic draw-                                                        training set out of unlabeled movie shots in order to fine-
                         ing skills. These reference images and sketches are widely                                                           tune a pre-trained model; a set of three fine-tuned models
                         employed in storyboarding, an essential film-making tech-                                                            catered towards the generations of three specific shot types:
                         nique that aids in visualizing the narrative and streamlining                                                        close shot, medium shot, and long shot.
                         the filming process. Within this context, the selection of the                                                          The paper is organized as follows: Section 3 covers the
                         desired shot type plays an important role, as it significantly                                                       methodology and describes the techniques on which our
                         influences the audience’s focus and emotions [6].                                                                    approach relies; Section 2 discusses the methods exploited
                                                                                                                                              in the proposed methodology; Section 4 outlines the testing
                         Table 1                                                                                                              procedure, metrics used, and relevant results.
                         Total number and their respective downloads of the top 100 mod-
                         els hosted on Civitai.
                                 type                                            number                downloads
                                                                                                                                              2. Related Works
                                 DreamBooth Checkpoint                                 70                   5.575.099                         2.1. Storyboarding
                                 Lora DreamBooth                                       26                   1.670.288
                                 Textual Inversion                                      4                    348.187                          In recent years a growing number of studies have focused on
                                                                                                                                              the automation of video editing tasks. While these works,
                                To the best of our knowledge, the use of text-to-image                                                        such as [9] and [10], achieve impressive performance in
                                                                                                                                              the generation of a video, either given as input a textual
                         Published in the Proceedings of the Workshops of the EDBT/ICDT 2024
                                                                                                                                              prompt [10], or a combination of textual prompt and image
                         Joint Conference (March 25-28, 2024), Paestum, Italy
                         *
                           Main author.                                                                                                       [9], they focus on the generation of motion and do not take
                         †
                           Corresponding author.                                                                                              into account the shot type used.
                         $ tommaso.massaglia@studenti.polito.it (T. Massaglia);                                                                  By generating more scenographic shots, one of the many
                         bartolomeo.vacchetti@polito.it (B. Vacchetti);                                                                       applications that become available is text-to-image story-
                         tania.cerquitelli@polito.it (T. Cerquitelli)                                                                         board creation. Existing storyboarding tools either extend
                          0000-0001-5583-4692 (B. Vacchetti); 0000-0002-9039-6226                                                            digital painting applications (e.g. [11]), allow the user to
                         (T. Cerquitelli)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-   place predetermined objects in a scene to compose the de-
                                      tion 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
sired frame (e.g. [12]), provide a simple interface to create a     from natural language, the resulting model is much easier to
reference of the desired scene (e.g. [13]).                         scale compared to standard crowd-sourced dataset thanks
   For more deep learning-related approaches, StoryGAN              to the vast amount of text available on the internet. The
[14] generates a sequence of images that describe a story           representation that is learned with CLIP is tightly connected
written in a multi-sequence paragraph. To do this, the pro-         to language, which enables flexible zero shot transfer. Given
posed framework uses a sequential Generative Adversarial            a batch of 𝑁 (text, image) pairs, CLIP is trained to predict
Network [15] that consists of a Story Encoder, an RNN-based         which of the 𝑁 ×𝑁 possible pairings across a batch actually
Context Encoder, an image generator conditioned on the              occurred. To do this, CLIP learns a multi-modal embedding
story context, and an image/story discriminator that ensures        space by jointly training an image encoder (based on a vi-
consistency. Diffusion Models allow for high-quality gener-         sion transformer) and a text encoder to maximize the cosine
ation on multiple domains without needing specific training,        similarity of the image and text embeddings of the 𝑁 real
and a better understanding of the conditional text input than       pairs, while minimizing the cosine similarity of the 𝑁 2 − 𝑁
GANs. The conditioning based on previous frames could               incorrect pairings.
be a possible approach for increased temporal consistency
even in LDMs.                                                       2.4. Latent Diffusion
   Dynamic Storyboarding [16] approaches the storyboard-
ing task directly by automatically composing scenes out of          Latent Diffusion Models are introduced in [5] which pro-
user inputs by simulating in a virtual environment the scene        poses to move the diffusion process from the computation-
and discriminating the best proposal out of the available           ally expensive pixel space to a less intensive latent space.
ones. This approach generates rich and complex dynamic              Given an image 𝑥 ∈ R𝐻×𝑊 ×3 in RGB space, the encoder
(video) storyboards, but it lacks the customizability and           ℰ encodes 𝑥 into a latent representation 𝑧 = ℰ(𝑥), and the
intuitiveness that Diffusion Models offer through textual           decoder 𝒟 reconstructs the image from the latent, giving
conditioning. Furthermore, by using ControlNet 2.5 trained          ˜ = 𝒟(𝑧) = 𝒟(ℰ(𝑥)). Thanks to the latent representation
                                                                    𝑥
networks it’s possible to add conditioning through more in-         enabled by ℰ and 𝒟, likelihood-based modelling becomes
puts such as scribbles, which at the cost of a slightly higher      a more suitable task as higher complexity details are ab-
effort can lead to much better generations.                         stracted away and the learning can focus on the important
                                                                    semantic bits of the data. Rather than using an autoregres-
                                                                    sive, attention-based approach, image-specific inductive bi-
2.2. Text-to-Image Diffusion Model
                                                                    ases can be taken advantage of. The underlying UNet is built
Diffusion models are a type of probabilistic generative mod-        primarily from 2D convolutional layers. Different forms of
els that generate samples from a learned distribution by re-        conditioning can be applied during generation such as im-
versing the "diffusion process", modeled as a Markov process        age maps and text (which uses CLIP encodings to generate
of gradual Gaussian noise addition. The generative process          the conditioning tokens); the text-to-image generation pro-
is carried by gradually removing noise from a random initial        cess is carried by feeding as input a random noise vector
sample. A text-to-image diffusion model 𝜖𝜃 , given an noise         and a textual prompt to the denoising U-net of the model.
map 𝑧𝑡 ∼ 𝒩 (0, 1) at timestep 𝑡 and a conditioning vector
𝑐 = 𝜏𝜃 (𝑦) generated using text encoder 𝜏𝜃 and prompt 𝑦,            2.5. ControlNet
generates an image 𝜖𝜃 (𝑧𝑡 , 𝑡, 𝜏𝜃 (𝑦)). During training, the
sample generated using the conditioning 𝜏𝜃 (𝑦) is compared          Described in [17], ControlNet is a network structure de-
to its original counterpart 𝜖. The loss is computed as:             veloped to support additional input conditions in existing
                                                                    diffusion models; rather than controlling the synthesis of
                                                                    images only through text or an input image, ControlNet
     𝐿𝐷𝑀 = E𝑥,𝜖∼𝒩 (0,1),𝑡 [||𝜖 − 𝜖𝜃 (𝑧𝑡 , 𝑡, 𝜏𝜃 (𝑦))||22 ],   (1)   allows to use of inputs such as canny mapsand depth maps
                                                                    and poses as inputs for the denoising process, even com-
  where both 𝜏𝜃 and 𝜖𝜃 are jointly optimized during train-          bining them in the same process, allowing for an increased
ing.                                                                level of control on the output.
                                                                       ControlNet works by creating a trainable copy and a
2.3. CLIP                                                           locked copy of an existing large diffusion model; the locked
                                                                    copy preserves the network capabilities learned from billion
CLIP [7], short for Contrastive Language Image Pretraining,         of images, while the trainable copy is trained on task-specific
is a technique developed to approach the zero shot classi-          datasets to learn the conditional control. The two networks
fication task by learning the contents of an image directly         are then connected using a new type of convolution layer
from raw text description of it rather than from labels (such       called zero convolution. Only the first half of the denoising
as the classes found in the ImageNet dataset). By learning
U-Net is trained and the encoder blocks are connected to
their respective decoder blocks through zero convolutions.
Video ControlNet [18] proposes an approach that enhances
temporal consistency when converting an existing video
using Stable Diffusion.


3. Method
Modern diffusion models can increasingly produce photo-
realistic images through conditional generation that are al-
most indistinguishable from the human eye. The most com-
mon form of conditioning is through text (called ’prompt’).
By encoding text and using the resulting encodings in the
cross-attentional layers of the denoising U-network as condi-
tioning, it is possible to influence the generation process to-
ward a desired outcome. In most cases, however, the amount
of control we can exert over the output is limited and re-
                                                                    Figure 1: A visualization of the finetuning process using LoRa
quires either specialized prompt engineering or fine-tuning
                                                                    DreamBooth. To create basic captioning that required minimal
to teach the model how to better represent the desired con-         human work, Blip2 was used. Labels for shot types were added
cept. Extensive fine-tuning can be prohibitively expensive          by hand due to the small number of pictures necessary.
and requires multiple GPU hours on a cluster. To solve this
problem, techniques have been developed to quickly add
new themes or styles to an existing large diffusion model
like DreamBooth[2].                                                 large enough one in order to have a guarantee of having
   The intuition behind our approach is that learning a shot        enough high-quality samples. (ii) Filtering: depending
type is similar in a way to learning a style (if a painter always   on the metadata available of the chosen dataset, filtering
painted portraits his "style" would always have the subject         out the lower-quality images, even with arbitrary filters,
close to the camera), and as such we could use DreamBooth           can largely improve the speed of the subsequent steps. (iii)
capabilities to teach an existing Latent Diffusion Model what       Cropping: the required resolution for images when fine-
different shot types are.ì                                          tuning Stable Diffusion is 1 × 1, with the most used sizes
   Figure (1) outlines the basic steps we adopted to fine-tune      being 768 × 768, 512 × 512 and 256 × 256. By using a
the model. The particular DreamBooth implementation we              content-aware cropping method it’s possible to obtain the
used leverages Low Rank Adaptation (LoRa) [19] to signifi-          necessary image size in a quick way while keeping the most
cantly reduce training time and more easily create shareable        important part of the shot. (iv) Labeling and shot selec-
checkpoints. The entire process consists of creating a well-        tion: as there is no precise enough approach for automatic
constructed dataset, since the quality of the training images       shot labelling and the shots require close supervision for
and labels greatly affects the output model, selecting a base       the quality of the image and the crop, labelling by hand
model for fine-tuning, and creating a ∆𝑊 . We refer to the          becomes a necessity. By sampling without repetition from
base model as 𝑊 and the fine-tuned model as 𝑊 ′ , such that         the available pool of images and assigning the correct label,
𝑊 ′ = 𝑊 + ∆𝑊 . ∆𝑊 contains the learned weights that                 it’s possible to quickly handpick and label the necessary
can then be invoked during inference to be applied to the           shots, which should range between 100 and 200 for styles.
selected base.                                                      A good movie variety should be kept to not teach unwanted
                                                                    subjects. (v) Captioning: once the required images per
                                                                    shot scale are reached, a first basic caption can be generated
3.1. Training set creation                                          by using models such as blip-2 [20], which also have the
The training set that is used when finetuning a pre-trained         advantage of generating captions that resemble the CLIP
diffusion model is one of the most important contributors           description style. Once again, human supervision is highly
to the output quality. As the model learns to reproduce the         suggested for the generated captions.
contents of the training set, by having high-quality samples,          Once the dataset is correctly prepared, the training can
the generated image quality will improve as well. Another           begin.
important aspect of the training set is the caption that is
associated with each image. The way DreamBooth adds                 3.2. Model Training
knowledge to a pre-trained model is by learning the con-
cepts of the input image that the original model doesn’t            In order to finetune the LDM we used DreamBooth [2]. The
already possess in its prior knowledge. In our case, the            idea behind DreamBooth is to, given a few input images
caption associated with each shot should include a highly           (≈ 3 − 5), bind the subject to a unique identifier such that
accurate description of the shot so that the model would            when it is used in the prompt along with the class it belongs
pick up the concept of the shot scale and not other already         to (e.g. "A [V] dog"), the prior knowledge of the class is used
known ones. To reach this goal, which is the creation of            along the new information to reconstruct the subject. A new
a task-specific training set, we define a 5 steps approach          autogenous class-specific prior preservation loss is intro-
that can be applied to any large dataset of movie shots. (i)        duced on top of the regular training objective to encourage
Data Collection: the first step is to acquire a large enough        diversity and counter language drift. During training, the
dataset to use as a base; movie shots datasets have a wide          model is supervised with its own generated samples in order
range of image quality, so it’s suggested to start from a           to retain the prior knowledge of the class and to use it along
with the knowledge of the subject instance to generate new       3.3. Generation
samples.
                                                                 Once the model is successfully trained, the generative pro-
   By itself, DreamBooth already manages to significantly
                                                                 cess can begin. Generation is performed by providing the
decrease the cost of adding a subject to an existing model.
                                                                 model with a series of parameters along with a textual
But, as a further optimization, we used Low Rank Adap-
                                                                 prompt describing the scene. The prompt can be either
tation [19] applied to the DreamBooth process [21]. LoRa
                                                                 in the positive field, where the generation is moved towards
allows efficient finetuning even in low-power devices while
                                                                 the conditioning, or the negative field, where the model
keeping a high-quality end-result. Instead of training the
                                                                 generates away from the concepts specified in the negative
entire model, LoRa works by finetuning the residual: i.e.
                                                                 field. Prompt engineering takes a big role in the generative
train ∆𝑊 instead of 𝑊 .
                                                                 process, with certain prompts such as "high quality" and
                     𝑊 ′ = 𝑊 + ∆𝑊                          (2)   "masterpiece" guiding the generated image towards more
                                                                 aesthetically pleasing results. The most meaningful genera-
   Through matrix decomposition it’s possible to further de-     tion parameters are:
crease the amount of parameters to finetune, hence reducing
the size of the output model by an even larger degree.                   • Sampler: at each step of the diffusion process a
                                                                           certain amount of noise is predicted and subtracted
                       ∆𝑊 = 𝐴𝐵      𝑇
                                                           (3)             from the image. The sampler takes care of both com-
   The attention layers parameters of the cross-attention                  puting the predicted noise and scheduling the noise
layers in the denoising U-Net of Stable Diffusion are enough               level at each sampling step so that an equally noisy
to tune to obtain the desired output.                                      image can be sampled. There are many available
   Given an existing diffusion model 𝑊 , a LoRa of it is                   with different benefits.
applied on top in the form of 𝑊 ′ = 𝑊 + 𝛼∆𝑊 : when 𝛼 is                  • Steps: changes how much noise is subtracted from
0 the model is the same as the original one when 𝛼 is 1 the                the image at each step, the larger the number of
model is the same as the fully finetuned one. Applying this                steps the slower the generation process is, but finer
form of optimization to DreamBooth makes it possible to                    details might be developed this way.
achieve two primary goals: faster and less complex training              • CFG Scale: short for Classifier Free Guidance scale,
and a lightweight and more versatile output.                               classifier free guidance is a technique that moves
   Once the training phase is finished, an output file is pro-             the generated samples away from random unlabeled
duced which contains the weights learned during training.                  ones, essentially making the generated image adhere
The model is then used alongside the original one that was                 more to the provided prompt.
used as a base during the finetuning process (in this case               • Seed: determines the initial noise map, different
stable-diffusion-v1-5) to synthesize images.                               seeds will result in different images.

                                                                    Furthermore, the value 𝛼 that determines how much the
                                                                 ∆𝑊 model weights are applied takes an important role in
                                                                 the generative process. As there is no deterministically per-
                                                                 fect way to train a DreamBooth model, sometimes lowering
                                                                 how much influence the finetune has can improve results.


                                                                 4. Preliminary Experiments
                                                                 4.1. Training Set
                                                                 Among the many available movie repositories, [FILM-
                                                                 GRAB]1 was chosen as it provides high quality, hand picked
                                                                 movie frames.
                                                                    We began by collecting 127.000 shots from 2166 movies.
                                                                 All the pictures with less than 3 color channels were pruned,
                                                                 as well as the ones coming from movies released before
                                                                 2013 to guarantee a certain degree of image quality and res-
Figure 2: prompt: a high-quality close_shot picture of a woman   olution. The shots were then cropped using content-aware
holding a cup of coffee in front of a brick building 𝛼Δ𝑊         image cropping to the size of 512 × 512 pixels because of
                                                                 computational constraints. Out of the remaining 41.750,
                                                                 only 600 (200 per shot type) were then to be selected. As the
  In our specific case, no unique identifier was specified
                                                                 number of required pictures is relatively small, shot-type
during training; by not binding the concept to a specific
                                                                 selection and labelling was performed by hand. Random-
token, the model always generates in the trained style (or
                                                                 ization was achieved by sampling single shots from all the
shot type in our case) when the ∆𝑊 model is specified in
                                                                 available ones and by assigning a label, adding it to the
the prompt.
                                                                 training set if and only if the quality and crop were deemed
  The caption in figure (2) is the prompt that was used to
                                                                 to be appropriate. As the training set is small, the training
generate the picture. The token ”𝛼∆𝑊 ” is a placeholder
                                                                 is very sensitive to bad samples.
control sequence that is added in the prompt to add the
weights and layers from the LoRa (∆𝑊 , closeshot in this
case) to the pretrained full model that’s being used for the
generation with weight 𝛼.                                        1
                                                                     Open source for research purposes.
   The final step was adding textual captions. To aid in the                                 CLIP-T     DINO
captioning process, the Vision-Language model blip2-flan-                        baseline     0.3221    0.4163
t5-xl [20] was used to generate a first CLIP [7] style caption                   ours         0.3269    0.4989
with human supervision.
                                                                 Table 3
                                                                 Results for the CLIP-T and DINO metrics on the 1500 pairs test.
4.2. Testing Set
The dataset used for testing is composed of 1800 shots sam-
pled from the filtered 41.750 shots evenly distributed be-       score than the baseline, indicating that the model gener-
tween the three shot types (long shot, medium shot, close        ates images at the specific fine-tuning scale even without
shot), and their respective caption generated using BLIP2        guidance.
[20] without supervision. The generated captions were not
supervised for testing purposes. The collected captions were                                 CLIP-T     DINO
then randomly sampled and used to generate two pictures                          baseline     0.3214    0.4014
from the same starting seed 𝑁 times, one with and one                            ours         0.3234    0.4803
without training, for a total of 1500 pairs of "trained" and
                                                                 Table 4
"non-trained" images, evenly split between shot types, with
                                                                 Results for the CLIP-T and DINO metrics on the ablation test.
generation parameters 2.

Table 2
The pararameters used for generation during testing              4.4. Qualitative Survey
  sampler                  DPM++ SDE Karras                      We conducted in addition a survey of human subjects. Each
  steps                              16                          subject was shown a total of 36 pairs of images 𝐴 and 𝐵
  seed                            random                         generated with the same setting and prompt, one from the
  cfg_scale                           6                          baseline model and one from the finetuned one. Whether an
  prompt      a high-quality [shot_type] picture of [caption]    image was labelled 𝐴 or 𝐵 was randomized. The generated
  size                           512 x 512
                                                                 patterns were monitored in a very light form to ensure that
                                                                 the images were safe for all. Each image pair was shown
                                                                 along with its associated shot type and generator prompt.
4.3. Metrics                                                     For each image pair, three questions were asked: (i) Which
                                                                 image do you like best?; (ii) Which image corresponds more
To get a quantitative result two metrics were adopted follow-    to the associated shot type?; (iii) Which image corresponds
ing in the footsteps of the original DreamBooth [2] imple-       more to the associated prompt?
mentation. The first one is CLIP-T [7], the average pairwise        The possible answers for each question were 𝐴, 𝐵, or
cosine similarity between the clip embeddings of the gener-      neither/same if the two images were considered equivalent
ated image and the prompt that generated it. The second          in some aspect. A total of 55 subjects responded to the
metric, DINO [8], measures the average pairwise cosine           survey, and the results are reported in Table 5. It can be seen
similarity between the ViTS/16 DINO embeddings of gen-           that even with human evaluation, our approach generates
erated and real images, essentially measuring how similar        images that are more appealing and closer to the associated
the generated image is to its real counterpart. The results      shot type and prompt in almost or more than half of the
shown in 3 show a slight (although significant for the con-      cases.
sidered metrics) increase for both the CLIP-T and DINO
scores over the baseline model. The lower increase seen in
                                                                 Table 5
the CLIP-T compared to the DINO metric is justified as the       Results collected from a survey conducted on 52 subjects. The
model doesn’t learn to represent more concepts (so from a        score are expressed as percentage over the total number of an-
CLIP perspective the objects present in the picture are the      swers.
same) with our finetuning, but instead learns to represent                  question              baseline    ours     same /
them closer to the training image, especially from a camera                                                            neither
distance perspective. From a qualitative analysis, it appears
that the fine-tuned model is more often able to generate         Which picture do you like          26.18     57.43      16.4
                                                                 most?
images that are semantically close to the prompt used to
                                                                 Which picture is closer to the     20.46     56.84      22.7
generate them. Sometimes it even generates elements that         associated shot type?
are present in the prompt that the baseline model ignored        Which picture is closer to the     20.35     49.31      30.34
(e.g., a person when two were specified, a car that is not       associated prompt?
present). In addition, since there is no free lunch, although
it has not been tested on other tasks, we expect the fine-
tuned model to perform worse on other generative tasks,             Aside from image likability, the baseline model obtained
and in the generated examples we can see that it more often      the lowest score of the three, indicating that the generation
generates faces similar to those shown during training.          is of equal quality to the generation without fine-tuning
   As a secondary and ablation study, 600 additional image       in most cases. The results are consistent, comparing the
pairs were generated using the same setup as before, but         survey to CLIP -T and DINO metrics. The higher likeability
removing all information regarding the acquisition type          and shot-type closeness are directly related to DINO and
from the text conditioning. Looking at the results of the        are noticeably higher than prompt closeness and CLIP-T
DINO score in Table 4, it can be seen that the images gen-       compared to the baseline.
erated with the fine-tuned model still have a higher DINO
                                                                     [3] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H.
                                                                         Bermano, G. Chechik, D. Cohen-Or, An image is worth
                                                                         one word: Personalizing text-to-image generation us-
                                                                         ing textual inversion, 2022. arXiv:2208.01618.
                                                                     [4] J. Ho, A. Jain, P. Abbeel, Denoising diffusion proba-
                                                                         bilistic models, 2020. arXiv:2006.11239.
                                                                     [5] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Om-
                                                                         mer, High-resolution image synthesis with latent dif-
                                                                         fusion models, 2022. arXiv:2112.10752.
                                                                     [6] B. Rooney, K. E. Bálint, Watching more closely: Shot
                                                                         scale affects film viewers’ theory of mind tendency but
                                                                         not ability, Frontiers in Psychology 8 (2018). doi:10.
                                                                         3389/fpsyg.2017.02349.
                                                                     [7] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh,
                                                                         S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,
                                                                         G. Krueger, I. Sutskever, Learning transferable vi-
                                                                         sual models from natural language supervision, 2021.
                                                                         arXiv:2103.00020.
                                                                     [8] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal,
                                                                         P. Bojanowski, A. Joulin, Emerging properties
                                                                         in self-supervised vision transformers, 2021.
                                                                         arXiv:2104.14294.
                                                                     [9] E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y. Matias,
                                                                         Y. Pritch, Y. Leviathan, Y. Hoshen, Dreamix: Video dif-
                                                                         fusion models are general video editors, arXiv preprint
                                                                         arXiv:2302.01329 (2023).
Figure 3: Some examples of the generation of the same subject       [10] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang,
with the three different trainings (close, medium, and long shot)        Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta,
with different levels of 𝛼                                               Y. Taigman, Make-a-video: Text-to-video generation
                                                                         without text-video data, ArXiv abs/2209.14792 (2022).
                                                                    [11] Storyboarder, https://wonderunit.com/storyboarder/,
5. Conclusions and Future                                                ????.
                                                                    [12] Storyboardthat, https://www.storyboardthat.com/,
   Developments                                                          ????.
                                                                    [13] Studiobinder,          https://www.studiobinder.com/
We have presented an approach that uses novel techniques                 storyboard-creator/, ????
such as DreamBooth and LoRa to finetune an existing la-             [14] Y. Li, Z. Gan, Y. Shen, J. Liu, Y. Cheng, Y. Wu,
tent diffusion model to generate specific types of shot types.           L. Carin, D. Carlson, J. Gao, Storygan: A sequen-
Based on the intuition that learning a shot type is similar              tial conditional gan for story visualization, 2019.
to learning a style, which DreamBooth was shown to be                    arXiv:1812.02784.
capable of, we achieve improvements in both compliance              [15] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza,
and similarity of reference images by using only 200 im-                 B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
ages for each shot type, as shown by CLIP -T, DINO, and                  Y. Bengio, Generative adversarial networks, 2014.
even human evaluation metrics. We test our approach on                   arXiv:1406.2661.
a storyboarding task showing the potential uses of mod-             [16] A. Rao, X. Jiang, Y. Guo, L. Xu, L. Yang, L. Jin, D. Lin,
ern LDMs in video production, mainly when supported by                   B. Dai, Dynamic storyboard generation in an engine-
domain-specific training. Furthermore, novel techniques,                 based virtual environment for video production, ArXiv
such as ControlNet open the doors to even more specific                  abs/2301.12688 (2023).
conditioning forms. Developments such as [18] show the              [17] L. Zhang, M. Agrawala, Adding conditional
power that ControlNet offers, and applying the technique                 control to text-to-image diffusion models, 2023.
for cinematic purposes could be an interesting development               arXiv:2302.05543.
point. Regarding our work, as DreamBooth training is far            [18] E. Chu, S.-Y. Lin, J.-C. Chen, Video controlnet: To-
from a solved task, more tests could yield even better results.          wards temporally consistent synthetic-to-real video
                                                                         translation using conditional image diffusion models,
References                                                               2023. arXiv:2305.19193.
                                                                    [19] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
 [1] R. M. Stability AI, Stable diffusion re-                            L. Wang, W. Chen, Lora: Low-rank adaptation of large
     lease     blog      post,    https://stability.ai/blog/             language models, 2021. arXiv:2106.09685.
     stable-diffusion-public-release, 2022. (accessed               [20] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Boot-
     23-May-2023).                                                       strapping language-image pre-training with frozen
 [2] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein,               image encoders and large language models, 2023.
     K. Aberman, Dreambooth: Fine tuning text-to-image                   arXiv:2301.12597.
     diffusion models for subject-driven generation, 2023.          [21] S. R. aka cloneofsimo, lora, https://github.com/
     arXiv:2208.12242.                                                   cloneofsimo/lora, 2023.

</pre>