<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">DreamShot: Teaching Cinema Shots to Latent Diffusion Models</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Tommaso</forename><surname>Massaglia</surname></persName>
							<email>tommaso.massaglia@studenti.polito.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Polytechnic of Turin</orgName>
								<address>
									<addrLine>24 Corso Duca degli Abruzzi</addrLine>
									<postCode>10129</postCode>
									<settlement>Turin</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Bartolomeo</forename><surname>Vacchetti</surname></persName>
							<email>bartolomeo.vacchetti@polito.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Polytechnic of Turin</orgName>
								<address>
									<addrLine>24 Corso Duca degli Abruzzi</addrLine>
									<postCode>10129</postCode>
									<settlement>Turin</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tania</forename><surname>Cerquitelli</surname></persName>
							<email>tania.cerquitelli@polito.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Polytechnic of Turin</orgName>
								<address>
									<addrLine>24 Corso Duca degli Abruzzi</addrLine>
									<postCode>10129</postCode>
									<settlement>Turin</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">DreamShot: Teaching Cinema Shots to Latent Diffusion Models</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">449F92F2B02A8D09F5B854A9EAA1A3F8</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:18+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Diffusion Models</term>
					<term>Shot Types</term>
					<term>text to image</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In recent years, several text-image synthesis models have been released that are increasingly capable of synthesizing realistic images close to the input. Among the various state-of-the-art techniques and models, the introduction of the open-source latent diffusion model Stable Diffusion [1] has led to significant developments in text-to-image generation in recent months. By using techniques such as DreamBoot [2] and Textual Inversion [3], it is possible to refine further and control the generation process to produce even more specific output than text alone would allow. We test this approach for generating three specific cinematographic shot types: Close-up, Medium Shot, and Long Shot. By fine-tuning based on Stable Diffusion 1.5 using a small dataset of 600 labelled and captioned film frames, we achieve a noticeable increase in CLIP -T and DINO scores and an overall noticeable qualitative improvement (as indicated by our human-run evaluation survey) in image likability, compliance, and shot type correctness.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Image generation has seen a major rise in popularity since the release of the Diffusion Model <ref type="bibr" target="#b3">[4]</ref> architecture, with improvements in the quality of the generations that made the pictures ever so close to realistic art pieces and photos. Being able to generate realistic pictures that follow a given textual description through the use of models such as the Latent Diffusion <ref type="bibr" target="#b4">[5]</ref> based Stable Diffusion <ref type="bibr" target="#b0">[1]</ref> opens up a multitude of previously unattainable tasks, which are further improved by the ability to add new subjects in a simple way provided by DreamBooth <ref type="bibr" target="#b1">[2]</ref>. By using these two techniques it would be possible to, for example, automatically generate an advertising campaign for a novel product or perform seamless photo editing through textual instructions. Notably, cinema heavily relies on the utilization and creation of reference images to enhance workflow efficiency. With the capacity to generate realistic images, generating expressive reference images that precisely convey the intended shot becomes readily accessible to all, eliminating the need for an extensive reference library or artistic drawing skills. These reference images and sketches are widely employed in storyboarding, an essential film-making technique that aids in visualizing the narrative and streamlining the filming process. Within this context, the selection of the desired shot type plays an important role, as it significantly influences the audience's focus and emotions <ref type="bibr" target="#b5">[6]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Total number and their respective downloads of the top 100 models hosted on Civitai. To the best of our knowledge, the use of text-to-image generation models and their fine-tuning in this context remains widely unexplored. In this paper, we explore the use of DreamBooth <ref type="bibr" target="#b1">[2]</ref> (as it is the most widely used finetuning approach for pre-trained Latent Diffusion models, as shown in table <ref type="table">1</ref>) in adding the knowledge of three specific shot types, close-shot, medium-shot, and long shot, to a pre-trained version of stable-diffusion-v-1-5 <ref type="bibr" target="#b0">[1]</ref>. Given a textual input and a desired shot scale, our methodology is able to generate synthetic scenes that are semantically close to the input and to the scale selected. Using the same testing setup that was proposed in the original DreamBooth <ref type="bibr" target="#b1">[2]</ref> paper, we achieve an improvement over the baseline model in both CLIP-T <ref type="bibr" target="#b6">[7]</ref> and DINO <ref type="bibr" target="#b7">[8]</ref> scores. We complement this testing with a survey conducted on 55 subjects which further shows the qualitative improvements achieved by our approach. Our contributions are the following: the outlining of a methodological approach to fine-tuning an existing latent diffusion model with state-of-the-art techniques (Dream-Booth)to teach a new style; the steps necessary to build a training set out of unlabeled movie shots in order to finetune a pre-trained model; a set of three fine-tuned models catered towards the generations of three specific shot types: close shot, medium shot, and long shot.</p><p>The paper is organized as follows: Section 3 covers the methodology and describes the techniques on which our approach relies; Section 2 discusses the methods exploited in the proposed methodology; Section 4 outlines the testing procedure, metrics used, and relevant results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Storyboarding</head><p>In recent years a growing number of studies have focused on the automation of video editing tasks. While these works, such as <ref type="bibr" target="#b8">[9]</ref> and <ref type="bibr" target="#b9">[10]</ref>, achieve impressive performance in the generation of a video, either given as input a textual prompt <ref type="bibr" target="#b9">[10]</ref>, or a combination of textual prompt and image <ref type="bibr" target="#b8">[9]</ref>, they focus on the generation of motion and do not take into account the shot type used.</p><p>By generating more scenographic shots, one of the many applications that become available is text-to-image storyboard creation. Existing storyboarding tools either extend digital painting applications (e.g. <ref type="bibr" target="#b10">[11]</ref>), allow the user to place predetermined objects in a scene to compose the de-sired frame (e.g. <ref type="bibr" target="#b11">[12]</ref>), provide a simple interface to create a reference of the desired scene (e.g. <ref type="bibr" target="#b12">[13]</ref>).</p><p>For more deep learning-related approaches, StoryGAN <ref type="bibr" target="#b13">[14]</ref> generates a sequence of images that describe a story written in a multi-sequence paragraph. To do this, the proposed framework uses a sequential Generative Adversarial Network <ref type="bibr" target="#b14">[15]</ref> that consists of a Story Encoder, an RNN-based Context Encoder, an image generator conditioned on the story context, and an image/story discriminator that ensures consistency. Diffusion Models allow for high-quality generation on multiple domains without needing specific training, and a better understanding of the conditional text input than GANs. The conditioning based on previous frames could be a possible approach for increased temporal consistency even in LDMs.</p><p>Dynamic Storyboarding <ref type="bibr" target="#b15">[16]</ref> approaches the storyboarding task directly by automatically composing scenes out of user inputs by simulating in a virtual environment the scene and discriminating the best proposal out of the available ones. This approach generates rich and complex dynamic (video) storyboards, but it lacks the customizability and intuitiveness that Diffusion Models offer through textual conditioning. Furthermore, by using ControlNet 2.5 trained networks it's possible to add conditioning through more inputs such as scribbles, which at the cost of a slightly higher effort can lead to much better generations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Text-to-Image Diffusion Model</head><p>Diffusion models are a type of probabilistic generative models that generate samples from a learned distribution by reversing the "diffusion process", modeled as a Markov process of gradual Gaussian noise addition. The generative process is carried by gradually removing noise from a random initial sample. A text-to-image diffusion model 𝜖 𝜃 , given an noise map 𝑧𝑡 ∼ 𝒩 (0, 1) at timestep 𝑡 and a conditioning vector 𝑐 = 𝜏 𝜃 (𝑦) generated using text encoder 𝜏 𝜃 and prompt 𝑦, generates an image 𝜖 𝜃 (𝑧𝑡, 𝑡, 𝜏 𝜃 (𝑦)). During training, the sample generated using the conditioning 𝜏 𝜃 (𝑦) is compared to its original counterpart 𝜖. The loss is computed as:</p><formula xml:id="formula_0">𝐿𝐷𝑀 = E 𝑥,𝜖∼𝒩 (0,1),𝑡 [||𝜖 − 𝜖 𝜃 (𝑧𝑡, 𝑡, 𝜏 𝜃 (𝑦))|| 2 2 ],<label>(1)</label></formula><p>where both 𝜏 𝜃 and 𝜖 𝜃 are jointly optimized during training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">CLIP</head><p>CLIP <ref type="bibr" target="#b6">[7]</ref>, short for Contrastive Language Image Pretraining, is a technique developed to approach the zero shot classification task by learning the contents of an image directly from raw text description of it rather than from labels (such as the classes found in the ImageNet dataset). By learning from natural language, the resulting model is much easier to scale compared to standard crowd-sourced dataset thanks to the vast amount of text available on the internet. The representation that is learned with CLIP is tightly connected to language, which enables flexible zero shot transfer. Given a batch of 𝑁 (text, image) pairs, CLIP is trained to predict which of the 𝑁 ×𝑁 possible pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder (based on a vision transformer) and a text encoder to maximize the cosine similarity of the image and text embeddings of the 𝑁 real pairs, while minimizing the cosine similarity of the 𝑁 2 − 𝑁 incorrect pairings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Latent Diffusion</head><p>Latent Diffusion Models are introduced in <ref type="bibr" target="#b4">[5]</ref> which proposes to move the diffusion process from the computationally expensive pixel space to a less intensive latent space. Given an image 𝑥 ∈ R 𝐻×𝑊 ×3 in RGB space, the encoder ℰ encodes 𝑥 into a latent representation 𝑧 = ℰ(𝑥), and the decoder 𝒟 reconstructs the image from the latent, giving 𝑥 ˜= 𝒟(𝑧) = 𝒟(ℰ(𝑥)). Thanks to the latent representation enabled by ℰ and 𝒟, likelihood-based modelling becomes a more suitable task as higher complexity details are abstracted away and the learning can focus on the important semantic bits of the data. Rather than using an autoregressive, attention-based approach, image-specific inductive biases can be taken advantage of. The underlying UNet is built primarily from 2D convolutional layers. Different forms of conditioning can be applied during generation such as image maps and text (which uses CLIP encodings to generate the conditioning tokens); the text-to-image generation process is carried by feeding as input a random noise vector and a textual prompt to the denoising U-net of the model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5.">ControlNet</head><p>Described in <ref type="bibr" target="#b16">[17]</ref>, ControlNet is a network structure developed to support additional input conditions in existing diffusion models; rather than controlling the synthesis of images only through text or an input image, ControlNet allows to use of inputs such as canny mapsand depth maps and poses as inputs for the denoising process, even combining them in the same process, allowing for an increased level of control on the output.</p><p>ControlNet works by creating a trainable copy and a locked copy of an existing large diffusion model; the locked copy preserves the network capabilities learned from billion of images, while the trainable copy is trained on task-specific datasets to learn the conditional control. The two networks are then connected using a new type of convolution layer called zero convolution. Only the first half of the denoising U-Net is trained and the encoder blocks are connected to their respective decoder blocks through zero convolutions.</p><p>Video ControlNet <ref type="bibr" target="#b17">[18]</ref> proposes an approach that enhances temporal consistency when converting an existing video using Stable Diffusion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head><p>Modern diffusion models can increasingly produce photorealistic images through conditional generation that are almost indistinguishable from the human eye. The most common form of conditioning is through text (called 'prompt').</p><p>By encoding text and using the resulting encodings in the cross-attentional layers of the denoising U-network as conditioning, it is possible to influence the generation process toward a desired outcome. In most cases, however, the amount of control we can exert over the output is limited and requires either specialized prompt engineering or fine-tuning to teach the model how to better represent the desired concept. Extensive fine-tuning can be prohibitively expensive and requires multiple GPU hours on a cluster. To solve this problem, techniques have been developed to quickly add new themes or styles to an existing large diffusion model like DreamBooth <ref type="bibr" target="#b1">[2]</ref>.</p><p>The intuition behind our approach is that learning a shot type is similar in a way to learning a style (if a painter always painted portraits his "style" would always have the subject close to the camera), and as such we could use DreamBooth capabilities to teach an existing Latent Diffusion Model what different shot types are.ì</p><p>Figure <ref type="bibr" target="#b0">(1)</ref> outlines the basic steps we adopted to fine-tune the model. The particular DreamBooth implementation we used leverages Low Rank Adaptation (LoRa) <ref type="bibr" target="#b18">[19]</ref> to significantly reduce training time and more easily create shareable checkpoints. The entire process consists of creating a wellconstructed dataset, since the quality of the training images and labels greatly affects the output model, selecting a base model for fine-tuning, and creating a ∆𝑊 . We refer to the base model as 𝑊 and the fine-tuned model as 𝑊 ′ , such that 𝑊 ′ = 𝑊 + ∆𝑊 . ∆𝑊 contains the learned weights that can then be invoked during inference to be applied to the selected base.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Training set creation</head><p>The training set that is used when finetuning a pre-trained diffusion model is one of the most important contributors to the output quality. As the model learns to reproduce the contents of the training set, by having high-quality samples, the generated image quality will improve as well. Another important aspect of the training set is the caption that is associated with each image. The way DreamBooth adds knowledge to a pre-trained model is by learning the concepts of the input image that the original model doesn't already possess in its prior knowledge. In our case, the caption associated with each shot should include a highly accurate description of the shot so that the model would pick up the concept of the shot scale and not other already known ones. To reach this goal, which is the creation of a task-specific training set, we define a 5 steps approach that can be applied to any large dataset of movie shots. (i) Data Collection: the first step is to acquire a large enough dataset to use as a base; movie shots datasets have a wide range of image quality, so it's suggested to start from a large enough one in order to have a guarantee of having enough high-quality samples. (ii) Filtering: depending on the metadata available of the chosen dataset, filtering out the lower-quality images, even with arbitrary filters, can largely improve the speed of the subsequent steps. (iii) Cropping: the required resolution for images when finetuning Stable Diffusion is 1 × 1, with the most used sizes being 768 × 768, 512 × 512 and 256 × 256. By using a content-aware cropping method it's possible to obtain the necessary image size in a quick way while keeping the most important part of the shot. (iv) Labeling and shot selection: as there is no precise enough approach for automatic shot labelling and the shots require close supervision for the quality of the image and the crop, labelling by hand becomes a necessity. By sampling without repetition from the available pool of images and assigning the correct label, it's possible to quickly handpick and label the necessary shots, which should range between 100 and 200 for styles. A good movie variety should be kept to not teach unwanted subjects. (v) Captioning: once the required images per shot scale are reached, a first basic caption can be generated by using models such as blip-2 <ref type="bibr" target="#b19">[20]</ref>, which also have the advantage of generating captions that resemble the CLIP description style. Once again, human supervision is highly suggested for the generated captions.</p><p>Once the dataset is correctly prepared, the training can begin.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Model Training</head><p>In order to finetune the LDM we used DreamBooth <ref type="bibr" target="#b1">[2]</ref>. The idea behind DreamBooth is to, given a few input images (≈ 3 − 5), bind the subject to a unique identifier such that when it is used in the prompt along with the class it belongs to (e.g. "A [V] dog"), the prior knowledge of the class is used along the new information to reconstruct the subject. A new autogenous class-specific prior preservation loss is introduced on top of the regular training objective to encourage diversity and counter language drift. During training, the model is supervised with its own generated samples in order to retain the prior knowledge of the class and to use it along with the knowledge of the subject instance to generate new samples.</p><p>By itself, DreamBooth already manages to significantly decrease the cost of adding a subject to an existing model. But, as a further optimization, we used Low Rank Adaptation <ref type="bibr" target="#b18">[19]</ref> applied to the DreamBooth process <ref type="bibr" target="#b20">[21]</ref>. LoRa allows efficient finetuning even in low-power devices while keeping a high-quality end-result. Instead of training the entire model, LoRa works by finetuning the residual: i.e. train ∆𝑊 instead of 𝑊 .</p><formula xml:id="formula_1">𝑊 ′ = 𝑊 + ∆𝑊<label>(2)</label></formula><p>Through matrix decomposition it's possible to further decrease the amount of parameters to finetune, hence reducing the size of the output model by an even larger degree.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>∆𝑊 = 𝐴𝐵 𝑇</head><p>(3)</p><p>The attention layers parameters of the cross-attention layers in the denoising U-Net of Stable Diffusion are enough to tune to obtain the desired output.</p><p>Given an existing diffusion model 𝑊 , a LoRa of it is applied on top in the form of 𝑊 ′ = 𝑊 + 𝛼∆𝑊 : when 𝛼 is 0 the model is the same as the original one when 𝛼 is 1 the model is the same as the fully finetuned one. Applying this form of optimization to DreamBooth makes it possible to achieve two primary goals: faster and less complex training and a lightweight and more versatile output.</p><p>Once the training phase is finished, an output file is produced which contains the weights learned during training. The model is then used alongside the original one that was used as a base during the finetuning process (in this case stable-diffusion-v1-5) to synthesize images. The caption in figure ( <ref type="formula" target="#formula_1">2</ref>) is the prompt that was used to generate the picture. The token "𝛼∆𝑊 " is a placeholder control sequence that is added in the prompt to add the weights and layers from the LoRa (∆𝑊 , closeshot in this case) to the pretrained full model that's being used for the generation with weight 𝛼.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Generation</head><p>Once the model is successfully trained, the generative process can begin. Generation is performed by providing the model with a series of parameters along with a textual prompt describing the scene. The prompt can be either in the positive field, where the generation is moved towards the conditioning, or the negative field, where the model generates away from the concepts specified in the negative field. Prompt engineering takes a big role in the generative process, with certain prompts such as "high quality" and "masterpiece" guiding the generated image towards more aesthetically pleasing results. The most meaningful generation parameters are:</p><p>• Sampler: at each step of the diffusion process a certain amount of noise is predicted and subtracted from the image. The sampler takes care of both computing the predicted noise and scheduling the noise level at each sampling step so that an equally noisy image can be sampled. There are many available with different benefits. • Steps: changes how much noise is subtracted from the image at each step, the larger the number of steps the slower the generation process is, but finer details might be developed this way. • CFG Scale: short for Classifier Free Guidance scale, classifier free guidance is a technique that moves the generated samples away from random unlabeled ones, essentially making the generated image adhere more to the provided prompt. • Seed: determines the initial noise map, different seeds will result in different images.</p><p>Furthermore, the value 𝛼 that determines how much the ∆𝑊 model weights are applied takes an important role in the generative process. As there is no deterministically perfect way to train a DreamBooth model, sometimes lowering how much influence the finetune has can improve results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Preliminary Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Training Set</head><p>Among the many available movie repositories, [FILM-GRAB]<ref type="foot" target="#foot_0">1</ref> was chosen as it provides high quality, hand picked movie frames.</p><p>We began by collecting 127.000 shots from 2166 movies. All the pictures with less than 3 color channels were pruned, as well as the ones coming from movies released before 2013 to guarantee a certain degree of image quality and resolution. The shots were then cropped using content-aware image cropping to the size of 512 × 512 pixels because of computational constraints. Out of the remaining 41.750, only 600 (200 per shot type) were then to be selected. As the number of required pictures is relatively small, shot-type selection and labelling was performed by hand. Randomization was achieved by sampling single shots from all the available ones and by assigning a label, adding it to the training set if and only if the quality and crop were deemed to be appropriate. As the training set is small, the training is very sensitive to bad samples.</p><p>The final step was adding textual captions. To aid in the captioning process, the Vision-Language model blip2-flan-t5-xl <ref type="bibr" target="#b19">[20]</ref> was used to generate a first CLIP <ref type="bibr" target="#b6">[7]</ref> style caption with human supervision.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Testing Set</head><p>The dataset used for testing is composed of 1800 shots sampled from the filtered 41.750 shots evenly distributed between the three shot types (long shot, medium shot, close shot), and their respective caption generated using BLIP2 <ref type="bibr" target="#b19">[20]</ref> without supervision. The generated captions were not supervised for testing purposes. The collected captions were then randomly sampled and used to generate two pictures from the same starting seed 𝑁 times, one with and one without training, for a total of 1500 pairs of "trained" and "non-trained" images, evenly split between shot types, with generation parameters 2. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Metrics</head><p>To get a quantitative result two metrics were adopted following in the footsteps of the original DreamBooth <ref type="bibr" target="#b1">[2]</ref> implementation. The first one is CLIP-T <ref type="bibr" target="#b6">[7]</ref>, the average pairwise cosine similarity between the clip embeddings of the generated image and the prompt that generated it. The second metric, DINO <ref type="bibr" target="#b7">[8]</ref>, measures the average pairwise cosine similarity between the ViTS/16 DINO embeddings of generated and real images, essentially measuring how similar the generated image is to its real counterpart. The results shown in 3 show a slight (although significant for the considered metrics) increase for both the CLIP-T and DINO scores over the baseline model. The lower increase seen in the CLIP-T compared to the DINO metric is justified as the model doesn't learn to represent more concepts (so from a CLIP perspective the objects present in the picture are the same) with our finetuning, but instead learns to represent them closer to the training image, especially from a camera distance perspective. From a qualitative analysis, it appears that the fine-tuned model is more often able to generate images that are semantically close to the prompt used to generate them. Sometimes it even generates elements that are present in the prompt that the baseline model ignored (e.g., a person when two were specified, a car that is not present). In addition, since there is no free lunch, although it has not been tested on other tasks, we expect the finetuned model to perform worse on other generative tasks, and in the generated examples we can see that it more often generates faces similar to those shown during training.</p><p>As a secondary and ablation study, 600 additional image pairs were generated using the same setup as before, but removing all information regarding the acquisition type from the text conditioning. Looking at the results of the DINO score in </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Qualitative Survey</head><p>We conducted in addition a survey of human subjects. Each subject was shown a total of 36 pairs of images 𝐴 and 𝐵 generated with the same setting and prompt, one from the baseline model and one from the finetuned one. Whether an image was labelled 𝐴 or 𝐵 was randomized. The generated patterns were monitored in a very light form to ensure that the images were safe for all. Each image pair was shown along with its associated shot type and generator prompt.</p><p>For each image pair, three questions were asked: (i) Which image do you like best?; (ii) Which image corresponds more to the associated shot type?; (iii) Which image corresponds more to the associated prompt?</p><p>The possible answers for each question were 𝐴, 𝐵, or neither/same if the two images were considered equivalent in some aspect. A total of 55 subjects responded to the survey, and the results are reported in Table <ref type="table" target="#tab_4">5</ref>. It can be seen that even with human evaluation, our approach generates images that are more appealing and closer to the associated shot type and prompt in almost or more than half of the cases. Aside from image likability, the baseline model obtained the lowest score of the three, indicating that the generation is of equal quality to the generation without fine-tuning in most cases. The results are consistent, comparing the survey to CLIP -T and DINO metrics. The higher likeability and shot-type closeness are directly related to DINO and are noticeably higher than prompt closeness and CLIP-T compared to the baseline. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Future Developments</head><p>We have presented an approach that uses novel techniques such as DreamBooth and LoRa to finetune an existing latent diffusion model to generate specific types of shot types.</p><p>Based on the intuition that learning a shot type is similar to learning a style, which DreamBooth was shown to be capable of, we achieve improvements in both compliance and similarity of reference images by using only 200 images for each shot type, as shown by CLIP -T, DINO, and even human evaluation metrics. We test our approach on a storyboarding task showing the potential uses of modern LDMs in video production, mainly when supported by domain-specific training. Furthermore, novel techniques, such as ControlNet open the doors to even more specific conditioning forms. Developments such as <ref type="bibr" target="#b17">[18]</ref> show the power that ControlNet offers, and applying the technique for cinematic purposes could be an interesting development point. Regarding our work, as DreamBooth training is far from a solved task, more tests could yield even better results.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: A visualization of the finetuning process using LoRa DreamBooth. To create basic captioning that required minimal human work, Blip2 was used. Labels for shot types were added by hand due to the small number of pictures necessary.</figDesc><graphic coords="3,320.20,65.61,203.08,182.04" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: prompt: a high-quality close_shot picture of a woman holding a cup of coffee in front of a brick building 𝛼Δ𝑊</figDesc><graphic coords="4,88.59,427.93,180.51,180.51" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Some examples of the generation of the same subject with the three different trainings (close, medium, and long shot) with different levels of 𝛼</figDesc><graphic coords="6,77.31,65.61,203.06,281.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2</head><label>2</label><figDesc>The pararameters used for generation during testing</figDesc><table><row><cell>sampler</cell><cell>DPM++ SDE Karras</cell></row><row><cell>steps</cell><cell>16</cell></row><row><cell>seed</cell><cell>random</cell></row><row><cell>cfg_scale</cell><cell>6</cell></row><row><cell>prompt</cell><cell>a high-quality [shot_type] picture of [caption]</cell></row><row><cell>size</cell><cell>512 x 512</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 4 ,</head><label>4</label><figDesc>it can be seen that the images generated with the fine-tuned model still have a higher DINO</figDesc><table><row><cell></cell><cell>CLIP-T</cell><cell>DINO</cell></row><row><cell>baseline</cell><cell>0.3221</cell><cell>0.4163</cell></row><row><cell>ours</cell><cell>0.3269</cell><cell>0.4989</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Results for the CLIP-T and DINO metrics on the 1500 pairs test.</figDesc><table><row><cell cols="3">score than the baseline, indicating that the model gener-</cell></row><row><cell cols="3">ates images at the specific fine-tuning scale even without</cell></row><row><cell>guidance.</cell><cell></cell><cell></cell></row><row><cell></cell><cell>CLIP-T</cell><cell>DINO</cell></row><row><cell>baseline</cell><cell>0.3214</cell><cell>0.4014</cell></row><row><cell>ours</cell><cell>0.3234</cell><cell>0.4803</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Results for the CLIP-T and DINO metrics on the ablation test.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Results collected from a survey conducted on 52 subjects. The score are expressed as percentage over the total number of answers.</figDesc><table><row><cell>question</cell><cell>baseline</cell><cell>ours</cell><cell>same /</cell></row><row><cell></cell><cell></cell><cell></cell><cell>neither</cell></row><row><cell>Which picture do you like</cell><cell>26.18</cell><cell>57.43</cell><cell>16.4</cell></row><row><cell>most?</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Which picture is closer to the</cell><cell>20.46</cell><cell>56.84</cell><cell>22.7</cell></row><row><cell>associated shot type?</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Which picture is closer to the</cell><cell>20.35</cell><cell>49.31</cell><cell>30.34</cell></row><row><cell>associated prompt?</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Open source for research purposes.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename></persName>
		</author>
		<ptr target="https://stability.ai/blog/stable-diffusion-public-release" />
		<title level="m">Stability AI, Stable diffusion release blog post</title>
				<imprint>
			<date type="published" when="2022-05-23">2022. 23-May-2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Jampani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Pritch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rubinstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Aberman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2208.12242</idno>
		<title level="m">Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Gal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Alaluf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Atzmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Patashnik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">H</forename><surname>Bermano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chechik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cohen-Or</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2208.01618</idno>
		<title level="m">An image is worth one word: Personalizing text-to-image generation using textual inversion</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Ho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Abbeel</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.11239</idno>
		<title level="m">Denoising diffusion probabilistic models</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Rombach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Blattmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lorenz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Esser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Om</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2112.10752</idno>
		<title level="m">mer, High-resolution image synthesis with latent diffusion models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Watching more closely: Shot scale affects film viewers&apos; theory of mind tendency but not ability</title>
		<author>
			<persName><forename type="first">B</forename><surname>Rooney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">E</forename><surname>Bálint</surname></persName>
		</author>
		<idno type="DOI">10.3389/fpsyg.2017.02349</idno>
	</analytic>
	<monogr>
		<title level="j">Frontiers in Psychology</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2103.00020</idno>
		<title level="m">Learning transferable visual models from natural language supervision</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Caron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Misra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jégou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mairal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bojanowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.14294</idno>
		<title level="m">Emerging properties in self-supervised vision transformers</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Molad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Horwitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Valevski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Acha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matias</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Pritch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Leviathan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Hoshen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.01329</idno>
		<title level="m">Dreamix: Video diffusion models are general video editors</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">U</forename><surname>Singer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Polyak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hayes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>An</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Ashual</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Gafni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Taigman</surname></persName>
		</author>
		<idno>ArXiv abs/2209.14792</idno>
		<title level="m">Make-a-video: Text-to-video generation without text-video data</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title/>
		<author>
			<persName><surname>Storyboarder</surname></persName>
		</author>
		<ptr target="https://wonderunit.com/storyboarder/,????" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<ptr target="https://www.storyboardthat.com/,????" />
		<title level="m">Storyboardthat</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title/>
		<author>
			<persName><surname>Studiobinder</surname></persName>
		</author>
		<ptr target="https://www.studiobinder.com/storyboard-creator/,????" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Gan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Carin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Carlson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1812.02784</idno>
		<title level="m">Storygan: A sequential conditional gan for story visualization</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">J</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pouget-Abadie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mirza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Warde-Farley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ozair</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1406.2661</idno>
		<title level="m">Generative adversarial networks</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Dynamic storyboard generation in an enginebased virtual environment for video production</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dai</surname></persName>
		</author>
		<idno>ArXiv abs/2301.12688</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Agrawala</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.05543</idno>
		<title level="m">Adding conditional control to text-to-image diffusion models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Chu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-C</forename><surname>Chen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.19193</idno>
		<title level="m">Video controlnet: Towards temporally consistent synthetic-to-real video translation using conditional image diffusion models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">J</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wallis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Allen-Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2106.09685</idno>
		<title level="m">Lora: Low-rank adaptation of large language models</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Savarese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.12597</idno>
		<title level="m">Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">R</forename><surname>Cloneofsimo</surname></persName>
		</author>
		<ptr target="https://github.com/cloneofsimo/lora" />
		<title level="m">lora</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
