Introduction

A First-Order Theory of Film Scores for Generation from Lightweight Speci cations

0 Algorithm Input and Output 1 Halley Young Department of Computer Science University of Pennsylvania Pennsylvania , PA 19104 , USA

This paper proposes a formal theory of the way lm scores operate for the purpose of enabling semiautomatic generation. Among the contributions are a formalization of the entire generation process as a bit-vector-array satis ability problem, an approach to music generation not taken in many previous papers. The paper also formalizes the idea of \thematic" and \stylistic" time-dependent variables and their inherited constraints in speci cation-driven generation. In order to make the result more coherent, the paper formalizes a regular-expression-like grammar of melodic contour. Synthesizing all of these contributions, the result is a program which can take a lightweight speci cation of the relevant information in each scene of a lm, and produce a coherent and appropriate score to accompany it.

Introduction

The lm industry turns out over $200 billion of lms every year, and a substantial portion of that is spent creating appealing lm scores ((?)). According to studies by Stuart Fischo , himself both a lm writer and scholar of media psychology, music scores to lms account for much of our understanding of the emotional impact of lm scenes as well as characterizations of different characters, locations, and events ((?)). However, there has not been substantial research on formalizing the way that lm scores produce these e ects. While there has been some research on producing lm scores semi-automatically, these approaches are either statistical (and su er the same limitations as most deeplearning based music, such as lack of memorable material or global structure), or don't include a background theory of lm composition, and thus require extensive manual speci cation by the composer ((?) (?)). This study proposes a formal theory of lm music, written in a decidable fragment of rst-order logic. The theory allows for the generation of appropriate lm music from lightweight annotations. We propose an algorithm for generating lm scores from lightweight annotations. The input to this algorithm is a speci cation. A speci cation contains an arbitrary number of lines, each of which contains a list of variables, a speci ed duration time in seconds, and, optionally, a description of the scene (used only for documentation and not by the algorithm). Variables can either be stylistic variables (de ned in the theory of lm, as per the appendix), or thematic variables (only dened in the universe of the speci c lm). For instance, in the following speci cation of a short manufactured example, \Jose" (a character) is a thematic variable, while \ amenco" (a well-de ned style of music found in Spain) is a stylistic variable, as is \suspense" and \happy":

Jose grew up in Spain. {Jose, Flamenco} 8 Then he moved to New Orleans. {Jose, Zydeco} 10 It was there that he met Sally. {Sally, Zydeco} 4 Sally was the most beautiful person he'd ever met. {Sally, romantic} 6 They got married and moved back to Spain, where they had a child. {Jose, Sally, child, Flamenco} 6 But then an alien invasion came, and infected the child. {child, horror, alien, suspense} 8 In the end, Jose and Sally ended up having to go to space to beg the alien king to save their child. {Jose, Sally, child, alien, suspense} 8 The alien king was touched by their plea, and their child was saved. {suspense, happy, alien, child} 6 They all lived happily ever after. {sally, jose, child, happy} 6

The output of the algorithm is music which conforms to this speci cation, in that, for a sequence of variable lists S1 : : : Sn and durations a1 : : : an, the variables in Si are present in the music at timestamp Pij=10 aj to Pij=0 aj. (A variable being \present" is de ned below). Furthermore, the generated music is \musically coherent" (also de ned below).

Building on SMT solvers - the Set-Theoretic Universe of Film Score Theory

Generating the basic building blocks of a lm score involves determining a set of values of \mid-level" musical variables at every moment in time. Some of these properties are completely independent (rhythmic density and harmonic progression), while others have mutual constraints (the number of rhythmic values must be the same as the number of pitch values). Some properties depend on the duration of time in which they are being used (for example, it's not realistic to have a full Andalusian progression in 2 seconds or less).

Musical Types and Values

The mid-level universe consists of a set T = t0 : : : tn of types of mid-level variable, such as \harmonic rhythm" (the rate at which the chords change), \rhythmic density" (the average duration of a single note in the melody), or \has unpitched percussion". These types are associated with their range of values, which can be boolean (e.g. whether or not there is an ostinato), a bounded integer (e.g. the degree of tension), a nite set (e.g. the list of possible chord progressions), or a list of bounded integers of bounded size (e.g. the melodic contour). In principle this can be extended to bounded oating point numbers, but for simplicity bitvectors were used. Boolean variables in the SMT logic are modeled simply as Boolean SMT variables, bounded integers are modeled as bit-vectors, sets are modeled as 1-hot bit-vector variables, and lists of integers are modeled as a tuple of a xed-size arrays of bit-vectors of some maximal size kmax, and a value kactual < kmax such that all entries with indices > kactual are ignored.

Time-Span Sets and Thematic Sets

Consider a speci cation with m scenes and n total thematic variables in all of the scenes (n can easily be obtained by counting the number of unique elements of each Si that are not designated as \stylistic" by the theory). Under these assumptions, there will be a set M0 : : : Mm of sets of mid-level properties corresponding to each scene. These sets will be total in the sense that for every musical type t in T , one possible value of t will be in Mi: There will also be a set N0 : : : Nn of midlevel properties corresponding to each theme. However, these sets will not be total - some sets may, for instance, include one element of the set of possible chord progressions but no possible value of rhythmic density, while another may contain a possible valuation of rhythmic density but not of chord progressions. This is because, as lm music scholar Andrew Powell acknowledges, a leitmotif (the musical elements that together make up a \theme", or a memorable gestalt which can appear in various versions but still be recognizable), can be one of a variety of musical markers which \which serve[s] to distinguish a character, idea, or symbol", rather than one or several necessary and su cient musical characteristics ((?)).

Axioms of the Formal Theory

Axioms of the formal theory relate the speci cation of a lm score to constraints on its attributes. These constraints include formal de nitions of the high-level stylistic elements that are described in annotations, rules regarding the existence of a well-de ned leitmotif when appropriate, some basic rules which establish musical coherence, and ontological claims which relate to the logically necessary relationship between various variables.

Stylistic

Stylistic axioms tie the stylistic de nitions assumed by annotators to mid-level variables. Unfortunately, there is a shortage of academic papers on the speci c musical attributes associated with more broad words encompassing ideas such as genre or emotion. Where possible, I took de nitions from academic sources, including ((?) (?)). However, in some instance it was necessary to simply survey the non-academic resources available ((?)).

Below are a few of the de nitions used: 1. \Flamenco" style implies at least three of: the use a amenco percussive pattern, the use of castanets and clapping as percussive instruments, the use of guitar, the use of phrygian mode, and the use of an Andalusian chord progression. 2. \Horror" implies at least three of: existence of a repeating ostinato, use of dissonance, use of a chromatic chord progression or chord transformation, use of low register. 3. \Jazz" implies: the use of a stereotypically jazz percussive pattern, use of dominant seventh chords, and use of a high level of syncopation. 4. \Happy" implies: the use of a major scale, and either the use of a fast tempo or the use of a high register.

Note that in practice, these de nitions do not ensure the desired feeling, and indeed in the examples it can be di cult to discern exactly what style is being evoked. However, they can be thought of as probably necessary conditions, such that P (style(m) = xjm j= ), where x is a style and are its related constraints, is much higher than P (style(m) = xjm 6j= ). More research is necessary to determine other variables which would increase P (style(m) = x) for various styles.

Ontological

Ontological constraints include constraints which are inherent to the meaning of the di erent mid-level variables. For instance, in order for a scene to have \a violin playing the accompaniment", it is necessary both for the scene to be accompanied and for the scene to contain a violin; in order for a scene to have a \Andalusian cadence" (a speci c pattern de ned under the rules of 12-tone tuning and unde ned for tunings where n 6= 12), it is necessary for the tuning to be 12tone. To be precise, ontological constraints occur when there exist two variable assignments, p^hi and ^, such that there is no possible music satisfying the condition p^hi^ 6 th^eta. In an end-to-end system, where the entire music generation could be described as a single SMT instance, than if ^ were constrained to be true than the system necessarily would return a result such that ^; however, due to tractability issues discussed elsewhere, it was necessary to decouple the generation of \midlevel" and \low-level" variables. Therefore, the system has no a-priori knowledge that the variable describing \Andalusian cadence" is dependent on the variable describing \12-tone."

Leitmotivic/Thematic

The second type of constraint concerns making sure that leitmotifs are associated with the correct theme. In a major simpli cation, we assume that the speci cation can be cleanly separated into stylistic and thematic variables, so, while \the alien" might only occur in sciscenes, it is not itself assumed a-priori to have di erent de ning properties than \the cowboy" (although the cowboy may only occur in scenes which are designated as \Western", and thus will also in e ect be associated with this genre). The thematic variables impose additional constraints on the mid-level variables associated with each time span, as a musical element can only represent a theme if it is present in every scene where the theme is included in the speci cation, and not present in any scene where the theme is not included in the specication. In addition, themes must be unique, and must be noticeable. In a simpli cation, this is expressed as the following constraint on the thematic sets N0 : : : Nn: The sets must be completely disjoint, and each must contain at least two elements. Thus, for any two given themes a and b, either the value of type t associated with a is di erent than the one associated with b, or a includes a value of type t while b does not.

Axioms of Contour - Creating Coherent Motivic Material

The axioms regarding melodic contour require their own section, as they are slightly more complex. As discussed above, rhythmic and melodic contours are lists of bounded size of bounded integers. A tuple of a rhythmic contour and a melodic contour form a \motivic contour," and this tuple is one of the variables which can be assigned to a theme. Furthermore, an ordered list of motivic contours are assigned to each scene, with the number of elements roughly correlating to the duration of each scene. Thus, a scene can contain individual motivic contours corresponding to multiple themes if the scene duration is large enough.

A rhythmic or pitch contour is a list of numbers C = c0 : : : ck such that, if x 2 C; then 80 j < x; j 2 C: The rhythmic (or pitch) contour is associated with the following constraint on the rhythmic values r0 : : : rk: 8i < k, 8j < k, if ci < cj then ri < rj , if ci > cj then ri > rj , and if ci = cj then ri = rj . Thus, the contour restricts the relative size of the di erent durations without restricting absolute sizes or even size ratios. This is a standard de nition among modern music theorists (?).

Why Motivic Contours?

Most accounts of melodic ideas involve speci c pitches and rhythms. Arnold Schoenberg was perhaps the rst to explicitly promote the motivic contour as a core component of a musical idea; however, uses of constant contoural structures over changing pitches have been an element of Western music at least since Bach ((?) (?)). For this algorithmic approach, it is useful to cleanly divide between contoural structure and the speci c pitch and rhythmic elements so that both can be constrained and assigned values independently. For instance, a [0; 2; 1] pitch contour can be associated with any type of scale or chord, thus increasing the size of the possibility space by an order of magnitude. It could be ful lled by a pitch sequence [C4, G4, E4] (a major triad, or one particular kind of harmony), or [C4, G4, D4] (a sus4 triad, or a harmony with a di erent emotional valence).

Grammatical Contours

I introduce a language for describing valid grammatical contours. This language is based on prior work by cognitive and computational musicologists. It can be used to enumerate a sequence of contoural values which is more likely to sound \musical" than a contoural sequence generated by randomly choosing numbers on a given interval in Z. Readers can subjectively compare tunes generated by the two methods by going to https://www.seas.upenn.edu/~halleyy/ random-and-verified-melodies.

Note that this grammar assumes that the user wants the theme to be coherent and uphold the sort of contoural constraints seen in pre-20th century music. This is not always the case for avante-garde music, nor for lm music in general. An extension of this work would eliminate the contoural grammar in very speci c scenarios where doing so would create the appropriate e ect.

Musicological Antecedents of the Contoural Grammar

The development of the contoural grammar draws on work by authors including Larson, Narmour, Ockelford, and Meredith, all of whom developed melodic theories ((?) (?) (?) (?)). Central to all of their work (whether under the label of \inertia" in Larson's physics-based theory or as \compressibility" in Meredith's computational theory) is the idea of a necessary degree of repetition and controlled variation, including several stereotyped methods of variation. Larson also introduces other operators such as \gravity", or the idea that after a leap a pitch should tend to fall down, which will be incorporated into the contoural grammar.

The Rhythmic and Melodic Contoural Languages as Interpreted Subsets of Regular Grammars

The contoural languages take the following form: a list of values (with an optional repetition exponent), references, and transformations on references, followed by a list of reference valuations. The reference valuations are lists of values, with an optional repetition exponent.

Examples of Famous Works in Regex form

The main theme of Mozart's 25th piano concerto is one of the most beloved melodies in classical music. Below is the pitch contour of the theme:

[0; 0; 0; 1; 1; 2; 2; 3; 1; 3; 5; 4; 4; 3; 3; 2; 2; 2; 2; 3; 3; 4; 4; 5; 1; 3; 5; 3; 3; 2; 2; 1] ( 1 ) ( 2 ) ( 3 ) This can be read as the interpretation of the following regex: (012(ui1)(m0)(uu1)2(i1)) ((0; 0; 0); (1; 1; 2; 2; 3); (1; 3; 5)) ( 4 ) ( 5 ) In English, this can be interpreted as \the rst pattern (0,0,0) followed by the second pattern ( 1,1,2,2,3 ) followed by the third pattern ( 1,3,5 ) followed by the second pattern transposed to start at the current value and inverted ( 4,4,3,3,2 ) followed by the rst pattern transposed up two levels followed by the third pattern followed by the inverted second pattern".

Similarly, the rhythmic contour of the main theme of Smetana's Moldau has the following pattern: This can be read as the interpretation of the following regex: In English, this can be interpreted as \the rst pattern (itself a tri-fold repetition of a simple pattern), followed by a tri-fold repetition of the value 2, followed by the second pattern, followed by a retrograde of the rst pattern, followed by the second pattern with the third value augmented."

Necessary Constraints on Melodic Contours - Axioms of \Coherence"

According to several authorities cited above, coherent music necessarily must involve a substantial (but not an excess) of repetition, and speci cally varied repetition. It is thus necessary to constrain the valuations of the regex. The following constraints were imposed: 1. If the melody is of su cient length (> 6 seconds), each of the patterns has to either be used in its original form at least once and then used in some other form twice, or used twice in its original form. 2. If the melody is very short (<4 seconds), only one pattern can be used, and if it is somewhat short (<6 seconds), only two patterns can be used.

In addition, as per Larson's description of the consequences of \gravity" and \inertia", there was a constraint on what can follow a leap (contoural values xi and xi+1 such that abs(xi+1 xi) > 4, namely that a leap has to be followed by a \step" - abs(xi+1 xi) < 2 - in the opposite direction.

Note that the only tested melodies were at most 15 seconds long (about 8 bars, which is typical of an antecedent-consequent style Classical theme), which signi cantly reduced the possible complexity of the melodies. More research is necessary in order to achieve coherence across larger time spans.

mid-level materials to notes To generate notes from the values (including contoural values and mid-level variables) output by the SMT solver in the rst pass, further satis ability and constrained optimization problems were constructed. First, generating a rhythm from the rhythmic contour was framed as constrained optimization: It was necessary for the contoural constraints to be recognized for all i and j, (length(xi) > length(xj ) if and only if the ith contoural value was greater than the jth), the total length of the rhythmic pattern was constrained to be the speci ed length of the scene, and an optimization was sought that maximized the sense of meter. After the rhythm was generated, melody was generated with constraints maintaining contoural values and de nitions of chord progressions and scales (each pitch modulo 12 has to be either in the respective chord, in the respective scale and in between two notes less than 3 semitones apart, or in between two notes less than two semitones apart). Finally, the accompaniments were chosen to match the instrumentation, dissonance level, thickness, spacing, etc. of the other mid-level variables.

Restriction sequences and tractability

The method of determining rst the mid-level variables, then rhythmic values, then pitch values, and nally accompaniment and timbral features in a series of disjoint SMT instances suggests an interesting avenue of research. In the rst implementation, features were to be generated all at once in a single constrained optimization instance. However, the search space was apparently far too large for the optimization to terminate. Even the di erence between separating duration and pitch generation vs. determining them together was decisive in determining feasibility (separate runs proved tractable while joint generation was not). One could understand the ordering of variables to be synthesized as a sequence of operations, each of which further restrict the search space over all possible melodies.

Empirical Results Generation from random scores

A suite of 30 random lm speci cations were generated by assigning xed probability distributions over seeing a given style over each scene's timespan, joint probabilities over theme variables, and a xed distribution of number of scenes. According to this analysis, 20.0% of randomly generated lm scores were satis able. In contrast, all of the three handcrafted synthetic lm speci cations and three handcrafted speci cations for existing lms were satis able. This discrepancy suggests that the distribution of styles and thematic material in real lms is non-uniform.

Generation from hand-crafted speci cation lm and

Three stories of lengths 27-62 seconds were handcrafted for the sake of this research. They were intended to be realistic, but also erred on the side of having a large amount of thematic and stylistic variety. It took an average of 189.6 seconds for the process of generation. Each of these lm speci cations had a satisfying generation. The reader is free to evaluate the results at https://www.seas.upenn.edu/~halleyy/ synthetic-film-score-generation. In particular, it is worth noting the drastic di erence in quality between the example where a composer manually wrote the piece but constrained herself to use the generated mid-level variables (example 1), and where the end-to-end system was used. This suggests that the mid-level generation may be more robust than the middle-to-low-level system.

Generation from lightweight annotation of existing lm

Three short lm clips of length 28-62 seconds were chosen, and speci cations were written for each. In practice, it took less than ve minutes to create each speci cation, suggesting that speci cation creation itself is not a limiting factor. Each of these lm speci cations had a satisfying generation. The reader is free to evaluate these results at https://www.seas.upenn.edu/ ~halleyy/real-film-score-generation.

Future Work

One of the appealing features of this approach is the lightweight nature of the speci cations - it took less than 5 minutes for the author to write up the annotations for a real-world 46-second scene, which according to several internet forums would be viewed by a professional composer as a task deserving of $375-750 ((?)). However, the opportunity to expand the speci cation could decrease the gap in expressivity between music generated automatically and music written by professional composers. For instance, in this approach there is a clear and simple distinction between stylistic and thematic variables. However, in a more expressive language, it would be possible to explicitly associate certain properties with characters as well as how those properties change over the course of the lm. In addition, several lm theorists have suggested that leitmotifs can be changed in a very deliberate manner through the span of the movie so as to suggest character development, a very important expressive possibility that is completely absent in this work.

Long-term structure is notoriously hard in lm music, and in music in general. This approach incorporates long-term structure in that there are recurring leitmotifs and in that each individual scene is scored using a principled musicological approach, but the sense of continuity between scenes is still signi cantly less than one would typically nd in most music. An improvement on this approach would be to include constraints on the distances between subsequent scenes, although the nature of these constraints are not obvious.

Due to the nature of SMT solvers, the valuations of each mid-level variable are not independent across executions of this algorithm (even on completely di erent scripts). This is a de cit because, as discussed above, composers ideally would like their material to sound relatively unique. Furthermore, the algorithm is not stochastic, as the output is determined by the heuristics used by the SMT solver. Thus, it is di cult to obtain a diverse list of possible outputs from a single speci cation. The most obvious solution to this is to use a Uniform-SAT module to maximize the independence between executions. However, at this time the number of SAT clauses is too large to apply UniformSAT. Optimizations which either reduce or modularize the number of SAT clauses could make this approach feasible, which would drastically increase the appeal of this algorithm.

Areas for Collaboration

As mentioned above, there is not su cient academic literature on what makes something sound "underwater" or "eerie." Collaborating with music experts could prove useful in developing more precise and accurate de nitions, as could partnering with HCI experts who work on learning conditional user preferences.

In particular, collaborating with lm composers could provide much needed feedback on the approach as well as benchmarks to compare to and speci c advice on areas for improvement within the algorithm.

Collaborations with experts in SMT solving could prove as rewarding as the interactions with lm composers. This is because the approach is fundamentally limited by what is tractable to compute, and significant sacri ces were made in the name of e ciency (namely deciding rhythmic contour/harmonic progression, rhythm, and pitch as three separate steps). If it were tractable to produce end-to-end systems, we would avoid issues such as pitches and rhythms being unable to t a given harmonic progression well.

Conclusion

In conclusion, this paper proposes a logical theory of lm scoring, as well as a theory for creating and verifying coherent melodic contours. Empirical studies suggest that lm scores do have signi cant structure and that this method may be promising. User studies in the future could enhance the impact of this research.

List of Stylistic Terms

1. Suspense

(de ned by a high tension level, resulting from some combination of having an ostinato, tremolos, dynamic contrast, chromaticism, rising contour, and dissonance

2. Relaxed

(de ned by having a low tension level)

3. Zydeco (de ned by having three of the following: amenco percussion, guitar, Andalusian cadence , and phrygian mode)

4. Americana

(de ned by using 3 of 4 of major scale, harmonica, I-IV-V progression, and washboard drum pattern, as well as lack of synth-based sounds)

5. Sci- ( de ned by having two of three of synth-based sounds, ostinatos, and modes of limited transposition)

6. Jazz

(de ned by having dominant seventh chords, a jazz-kit-based rhythm, and electric guitar or other stereotypically jazz instruments)

7. Romance (de ned by 3 of 4 of major key, moderate rhythm, high pitch , string instruments)

8. Happy (de ned by 3 of 4 of major key, fast rhythm, high pitch , consonance)

9. Sad (de ned by 3 of 4 of minor key, slow rhythm, low pitch , dissonance)

10. Underwater

(de ned by use of marimba or whole-tone scale)