1. Introduction

Compositional Steering of Music Transformers

Halley Young

1 2

Vincent Dumoulin

Pablo S. Castro

Jesse Engel

Cheng-Zhi Anna Huang

0 0 Google Brain , Canada 1 Google Brain , USA 2 University of Pennsylvania , USA

2022

Musical composition is a combinatorial art where composers extend sequences by choosing from a vast set of possible feature combinations that yield the compositions their distinctive qualities. Increasingly, composers are using generative models, such as music transformers, for crafting their pieces. Unfortunately, for composers to “steer” these models to satisfy their qualitative features typically requires retraining (which can be prohibitively expensive); further, existing models are unable to deal with arbitrary combinations of features at scale. In this paper we build on lightweight fine-tuning methods, such as prefix tuning and bias tuning, to propose a novel contrastive loss that enables us to steer music transformers over arbitrary combinations of logical features, with a relatively small number of extra parameters. We provide both quantitative and qualitative evaluations of our method which demonstrate its eficacy with respect to existing methods, as well as a case-study where our method was used to compose long-form musical pieces. Musical examples are available for listening online. 1 they would like to control, such as speed, dynamics, harmony, or texture, each of which can be decomposed into Musicians and novices are increasingly experimenting multiple sub-features. Furthermore, the relationship bewith generative models in music making [44, 19, 18]. Yet tween these features and the output is more difuse than generative models are often not trained to support human in the NLP settings: while in examples such as [25], inobjectives (such as controllability) but to maximize proxy dividual words indicate the diferent conditions being metrics that are easy to automate (such as likelihood of satisfied, in music the entire sequence of tokens are used data according to a model), making it dificult for users in evaluating a single feature (for instance, average pitch to steer these models towards expressing users' musical over a span of time depends on all tokens representing intentions [44, 19, 32]. In the language domain, recent that span). This makes compositionality even more chalresearch in making generative models (such as large lan- lenging than in the text generation setting. On the other guage models) more controllable and useful in real-world hand, music is a sequential domain where there are clear applications have focused on ways of adapting uncondi- logical features which can be examined, such as average tioned language models to perform well on conditioned dynamics or number of notes per second. This further motasks they were not initially trained on. Methods such as tivates using music as a test-bed for highly compositional ifne-tuning, side-tuning, bias-tuning, and prompt or pre- tasks involving simultaneously steering an autoregresifx tuning have emerged as lead candidates for such tasks sive model towards several particular desired attributes such as steering the sentiment of or words mentioned according to diferent classifiers. in a sentence [2, 28, 27, 8]. However, such tasks are nor- As a more immediate motivation, consider the followmally presented in a non- or minimally-compositional ing scenario: a composer wants to sample from the preapproach (possibly controlling for two orthogonal vari- trained Music Transformer [20] to complete a musical ables, such as sentiment and topic [25], but rarely more phrase. In addition to the overall musical quality of the than that). continuation, they want it to stay in key and switch to usIn contrast, in the domain of music generation, se- ing block chords (i.e. a few notes played together at once). quence continuation is inherently a highly compositional Repeatedly sampling continuations from the model and problem: the user likely has many aspects of the output cherry-picking a good sample (rejection sampling) would be very labor-intensive, but assuming they can compute binary features for “stays in key” and “uses block chords”, the composer could sample a large number of continuations and cherry-pick from the smaller set of continuations which exhibit all features (thereby delegating

eol>creativity co-creativity human-AI co-creation music generation controllable generative models compositionality

1. Introduction Prime Original continuation

Only one block chord “Uses block

chords” Steerable Transformer

Many block chords! part of the accept/reject step). While significantly less form structure, by chaining together chunks steered in labor-intensive, this solution could potentially require diferent directions (using diferent combinations of feasampling enormous amounts of continuations if, for in- tures), while maintaining long-term coherence (by leverstance, the requested features significantly difer from aging the transformers’ full self-attention receptive field). those of the priming sequence (such that it and the de- Listen to examples on this page 1 for longer steered exsired continuation form an unlikely sequence according amples, and compositions semi-automatically generated to the generative model). In other words, the pre-trained using our feature tuning approach in a musician-directed model could be a poor proposal distribution for some way. applications.

On the ML side of this work, we are interested in developing better proposal distributions through an adaptation 2. Problem formulation approach which can steer the generative model towards continuations which i) are significantly more likely to Music Transformer is an autoregressive language model exhibit the requested feature, and ii) exhibit a satisfactory which decomposes the joint probability of a sequence musical quality. The approach should be able to accom- of tokens 1, . . . , (where ∈ , and is a set of modate a large number of features without adding signif- categorical tokens) into icant memory or computation overhead. We achieve this by making features composable, making it possible to steer features independently and also multiple features at (x = 1, . . . , ) = (1) ∏︁ ( | 1, . . . , − 1). once. Figure 1 shows that when using a pre-trained trans- =2 former model augmented with a relatively small number (1) of additional parameters, we are able to steer towards ar- It leverages a common modeling approach which reprebitrary logical music features and achieve realistic music sents the conditional probabilities ( | 1, . . . , − 1) generation simply by sampling directly from the model. using a neural network [ 3 ]. As its name implies, MuIn contrast, the same unconditional transformer model sic Transformer uses a Transformer network architecfails to produce any examples satisfying those features ture [ 48 ]. Each token is first mapped to a real-valued even when generating a hundred samples. embedding e (for instance using a lookup table), then

Our approach can be used to support human-AI co- the network maps each sequence e1, . . . , e− 1 to a probcreation, where musicians can compose on the level of ability distribution for the value of over the elements musical “features” as opposed to notes. Musicians can of . prototype the high-level “shape” of the music by specify- Sequence continuation in an autoregressive language ing how various features change throughout the piece, model works by repeatedly sampling from its distribuand in turn steer and curate music transformers to cre- tion over the next token given the previous tokens. Startatively fill in the details. With our method, a composer 1Listen to musical examples at https://storage.googleapis.com/ can control both the short-chunk features, and the long- composing-features/index.html ing from some priming sequence x = (0, . . . , ), 1. Positively: given a prime–continuation pair we first sample +1 ∼

(· | +2 ∼ end of the continued sequence x = (+1, . . . , ). 2

(· | 0, . . . , , +1), and so on, until the Many downstream tasks can be cast as sequence continuation problems, including the steerable music generation

0, . . . , ), then problem investigated in this work.

We are given a set of features Φ = continuation pair exhibits that feature (which we note as

→ {0, 1}. Each takes the value 1 if a prime(x, x) |= ), and 0 otherwise. Note that the features must take both prime and continuation sequences as input, since some continuation features may be relative to the priming sequence (e.g. “significantly higher pitch”).

Our true objective with respect to feature is to steer the model towards a distribution which maximizes

{ }=1, ∈ Ex|x ︀[ (x, x) |= ]︀ (x, x) |= , we find an adaptation of the model’s parameters that minimizes ✓ (we use the symbol ✓ to denote the fact that is computed using the correct parameters ). By using prime–continuation examples that sound musical, we ensure that the steered model stays musically grounded.

2. Negatively: we can also take advantage of other

features for which (x, x) ̸|= , by maximizing × (we use the symbol × that is computed using the incorrect parameters

to denote the fact ).

The positive case corresponds to maximum-likelihood training. Additionally, we can exploit the intuition that the adapted parameters should “explain” the prime– (2) continuation pair (x, x) |

= better than (for some feature for which (x, x) ̸|= ) or (the nonadapted model parameters, with a corresponding loss while maintaining musicality.

This objective is sider the problem of composed features Φ ˆ , i.e.

In addition to the single-feature problem, we also con- loss of the form non-diferentiable because diferentiable satisfiability criterion.

(x, x) |= is a non- ∅). In other words, we can maximize the probability of choosing over and by minimizing a contrastive (x, x) |= Φ ˆ ≡ (x, x) |= ,

(3) ⋀︁ ∈Φ^ ⊆ Φ to account for scenarios where a user is interested in steering the model towards multiple features (such as in the “stays in key” and “uses block chords” scenario discussed in the introduction).

3. Proposed approach

We start by describing our proposed approach in the single-feature case and later on explain how we adapt it to the compositional case. 3.1. Likelihood-based training While approaches using reinforcement learning—such as KL-regularized deep Q-learning [ 21 ]—could be used to overcome the non-diferentiability problem, in this work we consider a proxy loss in the form of the negative log-likelihood = − log (x | x), (4) which we use in two ways: 2To simplify the discussion, we assume a fixed sequence length , but the explanation applies to sequences of varying lengths as well. − log ︂(

(x | x) (x | x) + (x | x) + (x | x) = − log ︂(

− ✓ − ✓ + − × + − ∅

We propose a loss that interpolates between Equations 4 and 5 using an coeficient (which is treated as a hyperparameter): ℒ = · ✓ − (1− )· log ︂(

− ✓ − ✓ + − × + − ∅ ︂)

Intuitively, the maximum-likelihood setting should sufice to achieve our adaptation goals, but in practice we find that the approach benefits from the inclusion of negative cases through a contrastive loss term. We tried diferent values and found = 0.8 to work well in practice. See Figure 2 for an illustration of the training setup. 3.2. Feature-conditional adaptation Fine-tuning all model parameters can be prohibitive if the number of features is large (let alone combinatorially large in the compositional case); however, recent work provides efective and lightweight alternatives: 1. Prefix tuning

[28] works by prepending learnable task embeddings e− , . . . , e− 1 to the priming sequence embeddings e1, . . . , e . The loss

Maestro dataset Prime

Continuation

Steerable

Transformer Unconditional

Transformer

In this section we describe our experimental setup: the musical features we want to enable users to control, the procedure for setting up the training data for adapting the feature-conditional parameters, and the details of the prefix tuning and bias tuning setup.

In practice, while prefix tuning showed promise in

the single-feature setting, we were unable to make it work in the compositional setting. We therefore focus our investigation on bias-tuning for the compositional domain.

In the compositional setting, a naive approach requires learning 2|Φ| − 1 model adaptations. Instead, we propose to express the adaptation for a composed feature Φ ˆ as the combination of the of its underlying features ∈ Φ ˆ . More specifically, for bias-tuning we average the adapted biases as = 1 ˆ |Φ | ∈Φ^ ∑︁ , .

(7)

Note that we do not simply use the above heuristic to compose feature adaptations post-hoc; we train the model in a compositional setting (by sampling prime– continuation pairs (x, x) |= Φ ˆ for various composed features) so that the single-feature adaptations can learn to work well in conjunction with each other. See Figure 3

Musical features To support users in controlling different aspects of music, such as harmony, texture, speed, and dynamics, we implemented eighteen features (see Appendix A for a complete list).

We included both absolute and relative features. Absolute features apply only to the continuation, while relative features describe the relationship between the prime and the continuation, such as the average pitch in the continuation being significantly higher than that in the prime or the rhythmic density in the continuation being significantly lower than that in the prime. Note that this type of model should work with any feature function which takes in sequences and returns a Boolean. In this paper, we chose examples that have clear musical meanings and are easy to implement.

Dataset features Before adapting model parameters to specific binary musical features, we may first need to “fine-tune” a model to better fit the general musical style of a given dataset (in order to support the scenario where a user brings their own dataset). We refer to these dataset-level features as . Input Concat/Average MLP

MLP Pre x tuning

I n n e r l a y e r + Average

I n n e r l a y e r + Average Bias vector Bias vector MLP

MLP

MLP Bias tuning Output 4.2. Training data and setup The unconditioned model we adopt as a base for adaptation is a pretrained music transformer 3 trained on transcribed YouTube piano performances (where the music was typically more melodic). To mimic the user bringing their own dataset with a distinctive style, we use the Maestro dataset [ 14 ], an open-source collection of virtuoso performances of classical piano music.

To prepare the prime–continuation pairs needed for likelihood based training, we serialize the Maestro piano performances into event-based encoding [33] (the same representation that the pretrained music transformer was trained on) and then take random crops of length 200 tokens (which lasts approximately 4 to 20 seconds long). We then split each crop in half, resulting in a prime x and continuation x pair that is each 100 token long.

To prepare training sets for feature-specific adaptations, for each feature ∈ Φ we take the subset (with replacement) of all the pairs that exhibit the feature (i.e. (x, x) |= ). Note that since all prime–continuation pairs are drawn from the Maestro dataset, we implicitly assume that (x, x) |= always holds.

Single-feature setting For the non-contrastive setting, for each feature we minimize the loss introduced in Equation 4 to adapt its parameters to better fit the set of prime–continuation pairs where (x, x) |= .

3See blog post "Generating Piano Music with Transformer"

https://magenta.tensorflow.org/piano-transformer.

In the contrastive setting, we minimize the contrastive loss introduced in Equation 6. For each prime– continuation pair (x, x) |= used to compute ✓, we also need to select a “negative” case for computing × . We achieve this by selecting another feature at random, and a prime–continuation pair where both (x, x) |= and (x, x) ̸|= holds.

Compositional setting Training with a contrastive loss yielded more efective models in the single-feature setting (see section 5 for details), hence we adopt the contrastive loss for all of the subsequent experiments in the compositional setting.

To prepare training examples for steering any number of features at once, we first extend Φ so that each feature has a corresponding “negated” feature (e.g. “has both loud and soft pitches” vs “no contrast in dynamics”), bringing the total number of features from |Φ | = 18 to 2 · | Φ | = 36. By definition, each prime–continuation pair exhibits exactly 18 features. When sampling a prime– continuation pair (x, x), we compute ✓ with respect to a random subset of those 18 features (with size drawn at random from [ 1, 12 ]). We compute × by negating some of the sampled features. 4.3. Evaluation metrics During training time, we minimize a likelihood-based proxy loss to fit the prime–continuations pairs as well as possible, either non-contrastively or contrastively. However, there is no guarantee that a model trained this way with a lower loss would be more efective in being steered to produce requested musical features. Hence, we propose to evaluate our “downstream” true objective using the following two axes:

In the section below, we define how we evaluate sampling eficacy and explain how we break down reporting results according to the “inherent” dificulty in steering.

4.3.1. Sampling eficacy (x, x) ̸|= .

At test time, what we care about is when a user requests a feature , regardless of what the prime x is, what is the probability that a tuned model can achieve it (i.e. generate continuations x^ that exhibit feature ).

Hence, we propose the following sampling eficacy (i.e. probability of achievement) as our true evaluation metric to quantify on average using rejection sampling, what proportion of a model’s generated continuations x^ exhibit feature

The unconditioned music transformer model allows

us to establish a baseline for “inherently” how dificult or easy it is to achieve a certain feature by priming the model and measuring how often its generated continuation carries that feature.

Figure 4 shows for each feature (blue dot) the probability of achieving it when the unconditioned music transformer was primed with cofactual primes (x-axis) versus

Ex Ex^|x ︀[ (x, x^) |= ]︀ (8) counterfactual primes (y-axis). Intuitively, we expect it where the first expectation is under the data distribu- to be more dificult for a prime that is counterfactual tion (approximated by sampling primes x from a given to a feature to produce continuations with that feature, dataset), and the second is under the model being evalu- indeed this is what we observe, i.e. in the figure all the ated (approximated by conditioning on a prime x, and blue dots are below the diagonal line. Given this consisthen using the model to generate continuations x^). tency, whenever we report results on sampling eficacy (probability of achievement) in the paper, we breakdown Steering dificulty Intuitively, given a prime x, cer- our results to compare how methods fare on the hard tain musical features would follow more musically than (counterfactual) versus the easier (cofactual) cases. others. That is, the prime can impact how dificult it is to steer a model to generate continuations x^ that exhibit a 4.4. Implementation certain feature , hence afecting the sampling eficacy of a model on that feature. Single-feature setting We use a standard prefix tun

To address this “confounding factor”, we propose to ing model as per [28], associating every feature with breakdown our analysis of sampling eficacy based on if a a prefix . As in their paper, we used the finding that feature musically follows a prime or not. To approximate a relatively low-dimensional embedding space of feathis, we check in the dataset (that a model was trained ture prefixes (200 dimensions) projected to a higherdimensional embedding (2048 dimensions) using a shared 5.1. Likelihood-based training dynamics MLP across feature prefixes worked well. The pre-trained transformer model had no parameters modified, but was Intuitively, the maximum-likelihood setting should sufchanged to accept these 2048-dimensional vectors as ifce to achieve our adaptation goals, but in practice we four 512-dimensional vectors prepended before the 512- ifnd that the approach benefits from the inclusion of negdimensional vectors representing the input tokens (as ative cases through a contrastive loss term (introduced the transformer’s latent embedding was 512 dimensions). in Equation 6), as demonstrated in Figure 5. The positive The resulting sequence of vectors were masked using case (blue line) corresponds to the maximum-likelihood causal attention in order to predict x. term, while the negative case (orange line) corresponds to the contrastive loss term. The visually “diverging” Compositional setting The loss used in the composi- training curves in the contrastive setting show that it has tional setting is almost identical to Equation 6, the main learned to “explain” features using very diferent paramediference being in the contrastive loss implementation. ters. That is, when evaluating a prime–continuation pair The features used to compute × are obtained by negating (x, x) with parameters adapted for feature that some of the features used to compute ✓, and the fraction the pair does not exhibit (x, x) ̸|= , the likelihood of negated features is used to modulate the interpola- loss becomes very high. tion coeficient as ′ = − min(, ). The intuition is that the diferences in steering should be more noticeable 5.2. Sampling eficacy when comparing feature sets with less overlap.

5. The single-feature setting

We first examine the single-feature setting, before going into the compositional setting. For the experiments in this section, we focus on investigating if our proposed contrastive loss (introduced in Equation 6) is beneficial.

We find that indeed it is, that training with a contrastive loss not only enables learning of parameters that better “explain” musical features in a more “discriminate” fashion, but also results in a more steerable model, i.e. when conditioned on a musical feature, more likely to generate samples that exhibit that feature.

In this section, we compare how diferent methods perform “downstream” on the objective we care about, that is when the user brings a prime and a desired feature, how likely is a model able to generate a continuation that exhibits that feature. The sampling eficacy metric (defined in in subsection 6.1) gives on average a method’s probability in achieving requested features.

Procedure In the following experiments, to compute a method’s sampling eficacy on a feature, we take 512 random primes from the feature’s validation set, and for each prime generate 10 continuations, and then check what proportion of the continuations satisfy the requested feature. We then average among all features to obtain the average sampling eficacy of a method. This is the cofactual case. While for the counterfactual case, instead

Prefix tuning Cofactual Counterfactual

Contrastive Results In Table 1, we see that prefixing tuning (es- 0.6 pecially with a contrastive loss) significantly increases y sampling eficacy over the unconditioned model, both ca in the easier cofactual cases, the harder counterfactual iffc0.4 cases. All of the improvements are of a large margin, for E example in the hard counterfactual cases, prefix tuning 0.2 Unconditional with a contrastive loss increases the probabiliy of achiev- Non-contrastive ing a feature by 70%, and within prefix tuning, the gain Contrastive from switching from a non-contrastive loss to contrastive 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 loss is 18%. Difficulty

As sampling eficacy can vary with prime and feature requested, in Figure 6 we enumerate how diferent meth- Figure 6: Prefix tuning a model (especially with a contrastive ods perform for each of the musical features (18 total). lsoinssg)leinfceraetausreess.itTshsea mcoplloinregdefidcoatcsyth(ayt-laixnies)ufporvemrtoisctaollfytchoerr1e8We see prefix tuning (especially with a contrastive loss) spond to the same feature. Each feature’s steering dificulty yields a higher sampling eficacy (y-axis) for most fea- (x-axis) is quantified as the sampling eficacy of the unconditures. As contrastive loss yielded more efective models tioned model, which gives a baseline of how dificult or easy in the single-feature setting, we adopt the contrastive it is to sample that given feature. Most of the green (nonloss for all of the subsequent experiments in the compo- contrastive) and orange (contrastive) dots are above the diagsitional setting. onal line, meaning prefix tuning increases the probability of achieving them. The vertical dotted orange lines show the improvement in the contrastive setting over the non-contrastive 6. The compositional setting setting. In the following experiments, we first illustrate how dififcult compositional steering is under an unconditioned model. Then, we show quantitatively how our biastuning method can enable compositional steering with much higher sampling eficacy, and also qualitatively how it achieves this while also improving on the musical quality of the steered examples (compared to if we were generating from the “tail” of the unconditioned model). 6.1. Sampling eficacy Unconditioned model Intuitively, we expect it would be inefective to rely on rejection sampling (on an unconditioned model) to give us samples that include a specific large number of features. Under a naive assumption of independence among features, one would expect that the eficacy of rejection sampling in achieving a total of features would roughly equal , where is the probability of a feature being satisfied.

While the above assumption is naive, we do find that in the unconditioned setting, the sampling eficacy of the unconditioned model drops substantially as the number of requested features increases. For example, when the number of requested cofactual features is 6, only 0.2% of the time were all features achieved after rejection sampling (see the unconditional line in Figure 7).

Bias tuning With bias tuning, sampling eficacy is

more efective overall and less brittle with respect to the number of features. We see in Figure 7 4 that even though sampling eficacy decreases for all methods when more

4Note that the single-feature steering eficacy presented in the

compositional setting (Figure 7) is much higher than that in Table 1

Difficulty vs. Efficacy

Cofactu

M U Bi Pr

Pr 1

2 3 4 Number of steere lty vs. Efficacy

Cofactual

Random features are requested as expected, bias-tuning remained rejection sampling on the unconditional model. a viable steering approach even in the 6-feature setting, with a 7.9% probability of achievement versus only 0.2% Listening study setup The listening test consists of for the unconditioned model. This meant for the user, questions posed as pairwise comparisons of musical exusing bias-tuning on average 13 samples is needed to ar- amples, where both examples started with the same prime rive at one that exhibits their requested 6 features, while as the first half of the example, while one’s continuation for the unconditioned model it would take 500 samples, was generated by the unconditional model, while another making the latter an unfeasible approach. was generated with our bias-tuned model. The ordering

Comparing the other methods, bias tuning to the Mae- within the pairs were randomized. Listeners were asked stro dataset (without tuning for specific features) im- to rate which one they thought was more musical. To proves over the unconditioned model slightly (with larger prepare the samples, we randomly picked 120 primes gains in the 2 to 3 feature case in the cofactual setting), from the test set. For steering each prime, we randomly and as expected underperforms bias tuning which is chose a three-feature set that was present in the dataset tuned to also condition on specific features. Bias tuning as the conditioning features. outperforms prefix tuning consistently in the compositional setting. 6.2. Musical quality The above experiments evaluate the efectiveness of methods in achieving the requested features. To evaluate the musical quality of these “successful” generations, we conducted a listening study with musicians to compare the steering of three features between our overall best adaptation approach (i.e. bias tuning) and the baseline of for the single-feature setting. This is due to diference in the procedure we used in the two settings. In the compositional setting, we sampled features according to how they were distributed in the datasets, while in the single-feature setting we sampled features uniformly at random. Hence, in the compositional setting, features that appear often in the dataset (perhaps typical of the dataset’s musical style) is sampled more often. The sampling eficacy is much higher possibly because these feature are also easier to achieve.

Results We asked eight musicians to each rate fifteen pairs. The results show that our bias-tuning approach was much preferred over the unconditional, and the results were statistically significant ( < 0.0003). Bias tuning won 63 of the pairwise comparisons, tied for 26 pairs, and lost for 31 pairs. This shows that our biastuning approach is not only more efective in steering features, but also produces musically more compelling results.

Discussion It may seem surprising that bias-tuning was able to produce samples that were perceived as more musical than the original expressive unconditional model.

We hypothesize that this is because we are essentially sampling from the “tail” of the unconditioned model’s distribution when using rejection sampling (i.e. only accepting samples that exhibit the requested feature).

Since the unconditional model is not trained to generate specific features, we may have to disregard for example 99.8% of the samples generated before finding a satisfac- same prime but with a diferent “vibe” by invoking a diftory sample (as seen in the 6-feature setting described ferent set of low-level features. 6 This allowed her to in subsection 6.1). This resulting distribution is very dif- combine her knowledge of non-automated algorithmic ferent from the distribution of the unconditional model composition to compose a “quasi-algorithmic" suite of without rejection sampling. variations. Algorithmic composition involves using com

We further hypothesize that, via the proxy loss of in- putational logic to choose notes; instead, in this piece, creasing the likelihood of a set of continuations with a she used computational logic to choose features. given feature, a bias-tuned model learns more likely ways to satisfy that given feature. Hence, its continuations are not only more likely to be efective at satisfying that feature, but also to do so in a musically likely manner, which is often rated by listeners as more musical too.

Composer’s reflection The composer’s findings were that the tool enabled her on every end of the humanmachine spectrum: as a composer who occasionally wants to avoid using any generation tools in their final output, as a composer who wants to co-create in order 7. Case study: Human-AI to minimize both manual coding and manual composing, and as an algorithmic composer. As a composer trying co-creation to improve her “manual" composition and analysis skills within specific scenarios (e.g., fast-paced and tense music In the previous sections, we evaluated our approach algo- with chromaticism but also a degree of stasis), and who rithmically, and found that it is quite efective in steering is used to having to manually search for examples of music transformers to generate continuations with multi- repertoire having such properties in order to learn from ple requested features present. In this section, we wanted them, the ability to generate a piece of music tailored to to understand how a generative model with such steer- a specific scenario forms a surprisingly powerful pedaability can be useful in a human-AI co-creation setting. gogical tool which she found that she can leverage when As a preliminary study, one asked one of our co-authors, composing by hand. who was also a composer, to put our model to test by The workflow of co-creating music made her feel like carrying out computer-assisted composition. she was the author (rather than the computer), but allowed her to create piano music of a complexity and User background and creative strategies Our com- virtuosity her piano skills do not currently aford. The poser had a background in both purely human and algo- composer felt that this tool was even more powerful in rithmic composition. She experimented with using the the context of algorithmic composition. Her typical worktool in several settings. In some settings, her goal was to lfow in algorithmic composition involves using tools such co-create music of a specific high-level “vibe", which she as SuperCollider [51] to adapt an existing algorithm to accomplished through the inclusion of specific features generate low-level notes; here she could still use algoand manually curating the generated samples. rithmic thinking, but at a higher level of abstraction. For instance, to compose the suite of variations presented • To achieve “pleasant but low energy”, she chose earlier, she wrote a python program to orchestrate which features such as diatonicism and harmonic stasis, set of musical features are used for each variation. In block chords, lack of extreme dynamic change, and contrast, while in the past she would invoke a Markov a mostly low rhythmic density. Listen to this ex- chain to flesh out each variation, here she could invoke a ample and other “vibes” online. 5 powerful generative model. • Inspired by a given prime that had a “roller- In this case study, as the composer was also the creator coaster"-like quality, she wanted to write a piece of this tool, she was in a unique position to explore the with a cyclic shape which oscillated between full potential of this tool. To make the tool accessible melodic and textural extremes. She accomplished to a broader audience, future work could include “Hello this through the coupling and cycling of features AI”-like [ 5 ] exercises to introduce how to compose algosuch as relative and absolute rhythmic density, rithmically with transformers, akin to the ones found in pitch height, and existence of block chords. textbooks on SuperCollider and other algorithmic composition environments.

Seeing that it is possible to steer music transformer to generate specific high-level “vibes” through low-level features, our composer was inspired to compose a theme and variation where each variation would continue the

5Listen to diferent co-created “vibes” at https://storage.

googleapis.com/composing-features/index.html#vibes.

6Listen to a co-created theme and variation at https://storage.

googleapis.com/composing-features/index.html.

8. Related Literature

language-model features, has shown significant enhancements in few-shot learning, but is typically not performed compositionally [53].

Our work builds on language modeling, further exploring ways to “tune” these models for user control, through “fine-tuning” approaches such as prefix and bias tuning.

In particular, we leverage contrastive learning and compositionality to derive a lightweight augmentation for steering large transformer-based language models. Our approach enables human-AI co-creation, enabling users to compose at a higher level of abstraction by specifying features while music transformers fills in the notes. In the following, we provide a brief overview for each of the aforementioned related research areas.

Contrastive learning Contrastive learning is used in representation learning to train a network which maps “similar” (positive) inputs to nearby representations and “dissimilar” (negative) inputs far away from the positive inputs. See [26] for a theoretical framework and overview.

In generative modeling, contrastive divergence [ 15 ] was proposed to train Restricted Boltzmann Machines [ 43 ], image-to-image translation models [ 1, 35, 30 ], and conditional [23] and unconditional [22] generative adverarial networks [ 12 ]. Our contrastive formulation difers from Language models as generative models This work previous work in two ways. First, rather than selecting would be impossible without the wealth of research on positive and negative examples related to the conditiontransformer models for sequence generation tasks. Start- ing signal (musical features in our case) and using the ing with Vaswani et al. [ 48 ], researchers realized that this contrastive loss to predict which example “agrees” with paradigm enables far more coherent and diverse genera- the signal, we select positive and negative conditioning tion than the recurrent neural networks typically used signals (i.e. diferent musical features) and use the conbefore [ 40 ]. Another milestone was the development of trastive loss to predict which conditioning signal explains GPT-3, which showed that with suficient size/training the prime–continuation pair best. Second, we also treat data such models could potentially perform few shot the absence of a conditioning signal (i.e. the original genlearning [ 4 ]. Several subsequent papers, however, demon- erative model) as a negative conditioning signal, meaning strated the inadequacy of few-shot learning for many that we want the model conditioned on the “correct” mutasks [ 37 ]. A few alternatives to online few-shot learning sical features to explain the prime–continuation pair bethave since emerged. ter than the unconditional model. Other work leveraging language models for multiple tasks include CTRL [25] “Finetuning” language models Prompt tuning was and Plug-and-Play models [ 8 ]. However, CTRL has the initially explored ad-hoc in the context of finding ways disadvantage that it requires knowing the tasks during to produce interesting output by GPT-3, and was formal- the training of the large language model, while Plugized by [27]. While originally prompts were designed as and-Play requires multiple passes through the language tokens from the transformer’s vocabulary, subsequent model, which can be expensive for suficiently large modstudies generalized them to any embeddings prepended els. to the input [28]. We extend research into prefix tuning by considering aggregation methods among com- Controllable generative models for music Adpositional prefixes. Feature-wise transformations, such vances in sequence modeling [ 46, 49, 47, 7 ] has enabled as elementwise scaling and/or biasing of features in a long-form music generation in both the symbolic donetwork based on side-information (such as labels for a main [ 33, 20, 36, 31, 16 ] and the audio domain [46, 14, 10, conditioned task), have been applied in a wide variety 9]. Similar to language, researchers in music generation of problem settings—see [ 11 ] for an overview. We draw have been adapting these language models towards condirect inspiration from BitFit’s bias-tuning approach [ 2 ] trollable generation, such as by conditioning on one part and cast it as a feature-wise transformation approach. By of a musical piece to complete the rest, such as melody factoring the additive perturbations in Figure 3 into their harmonization [ 41, 29, 6 ], or more generally arbitrary parpreceding layers, our bias-tuning implementation can be tial score completion [ 17, 13 ]. Representational learning described as a multi-task variant of BitFit where the pa- approaches such as autoencoders (AEs) and variational rameterizations are tied across features. A key diference autoencoders (VAEs) have also been used for steering is that our feature-specific adaptations are designed to interpolations or transformations along learned latent be composable, which to our best knowledge has not yet dimensions, a low-level disentangled attribute-based dibeen explored in the context of large language models— mension such as note density [ 39, 24 ], or a high-level although compositional adaptations using feature-wise learned dimension such as energy level in mood that is multiplicative interactions have been studied in the con- then realized through its mapping to low-level features text of zero-shot image classification [ 38 ]. Similarly, side- such as note-density and rhythm [ 45 ], controlling chord tuning, or summing task-specific features with general progressions and texture independently [ 42, 50 ], or rearranging a piece to have increased polyphony or rhythmic density [52].

Human-AI co-creation in music Controllable generative models typically does not ofer users full control (i.e. only allows users to specify a small number of low-level or high-level controls, or through an example), while relying on its learned stylistic distribution and/or features encoded from the user-specified template piece to fill in the rest of the musical details [ 44, 32, 19 ]. In contrast, traditional constraint satisfaction based music generation systems do not have prior knowledge of the desired stylistic distribution, instead rely on users to specify a large number of musical constraints to guide its search (see [34] for a survey). When using the former systems, users may still feel a lack of agency, while the latter can impose a laborious process. Our approach explores the space in between, allowing users to compose multiple features along diferent musical dimensions for short chunks (similar to constraint specification), while leveraging pretrained transformers‘ expressiveness to aid users in maintaining coherence in virtuosic long-form composition.

9. Conclusion

We have shown that music transformers can be directed towards a specific generative “task" using some of the same methods as natural language transformers. In addition, we have studied compositionality in this domain. Compositionality (including relatively high levels of compositionality) is critical in the music domain (and other domains) if the user wants control over the output. We establish that compositionality is a hard problem, and propose adaptations of several solutions from the literature (bias-tuning and prefix-tuning) to address these challenges. We find success with bias-tuning, but not with prefix-tuning. While our results are promising, there is clearly significant room for improving on the eficacy of steering a transformer compositionally.

We have provided a preliminary demonstration of how our approach, compositional steering, enables human-AI co-creation, where musicians can compose on the level of musical “features” as opposed to notes. Musicians can prototype the high-level “shape” of the music by specifying how various features change throughout the piece, and in turn steer and curate music transformers to creatively fill in the details. We envision future work to enable end-user machine learning where users can define their own features or provide their own musical examples, and leverage our lightweight compositional bias-tuning approach to learn new controls to steer expressive music transformers compositionally. 1. “Significantly lower average pitch than prime" 2. “Significantly higher average pitch than prime" 3. “Significantly higher number of grouped attacks than prime" 4. “Significantly lower number of grouped attacks than prime" 5. “Significantly more notes per second than prime" 6. “Significantly fewer notes per second than prime" 7. “Could fit in the same key as the prime" 8. “Has 2 or more pitch classes (pitch mod 12) which the prime doesn’t have"

[1]

Kyungjune

Baek , Yunjey Choi, Youngjung Uh, Jaejun Yoo, and

Hyunjung

Shim . 2021 . Rethinking the truly unsupervised image-to-image translation . In Proceedings of the IEEE/CVF International Conference on Computer Vision . 14154- 14163 .

[2]

Elad

Ben-Zaken ,

Shauli

Ravfogel , and

Yoav

Goldberg . 2021 . BitFit: Simple Parameter-eficient Finetuning for Transformer-based Masked Languagemodels . ArXiv abs/2106 .10199 ( 2021 ).

[3]

Yoshua

Bengio , Réjean Ducharme, Pascal Vincent, and

Christian

Janvin . 2003 . A neural probabilistic language model . The journal of machine learning research 3 ( 2003 ), 1137 - 1155 .

[4]

Tom

Brown , Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan , Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jefrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin , Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam

McCandlish

Alec

Radford , Ilya Sutskever, and

Dario

Amodei . 2020 . Language Models are Few-Shot Learners . Advances in Neural Information Processing Systems ( 2020 ).

[5] Carrie

J Cai

, Samantha Winter, David Steiner,

Lauren

Wilcox , and

Michael

Terry . 2019 . "Hello AI": Uncovering the onboarding needs of medical practitioners for human-AI collaborative decisionmaking . Proceedings of the ACM on Humancomputer Interaction 3 , CSCW ( 2019 ), 1 - 24 .

[6]

Kristy

Choi , Curtis Hawthorne, Ian Simon, Monica Dinculescu, and

Jesse

Engel . 2020 . Encoding musical style with transformer autoencoders . In International Conference on Machine Learning. PMLR , 1899 - 1908 .

[7]

Krzysztof

Choromanski , Valerii Likhosherstov, David Dohan,

Xingyou

Song , Andreea Gane, Tamas Sarlos,

Peter

Hawkins , Jared Davis, Afroz Mohiuddin,

Lukasz

Kaiser , et al. 2020 . Rethinking attention with performers . ICLR ( 2020 ).

[8]

Sumanth

Dathathri , Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino,

Yosinski , and Rosanne Liu. 2020 . Plug and Play Language Models: A Simple Approach to Controlled Text Generation . ArXiv abs/ 1912 .02164 ( 2020 ).

[9]

Prafulla

Dhariwal , Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and

Ilya

Sutskever . 2020 . Jukebox: A Generative Model for Music . arXiv preprint arXiv: 2005 . 00341 ( 2020 ).

[10] Sander

Dieleman

, Aaron van den Oord, and

Karen

Simonyan . 2018 . The challenge of realistic music generation: modelling raw audio at scale . In Advances in Neural Information Processing Systems . [22]

Alexia

Jolicoeur-Martineau . 2018 . The relativistic

[11] Vincent

Dumoulin

, Ethan Perez, Nathan Schucher, discriminator: a key element missing from standard Florian Strub , Harm de Vries, Aaron Courville, GAN. arXiv preprint arXiv: 1807 . 00734 ( 2018 ). and

Yoshua

Bengio . 2018 . Feature-wise transfor- [23]

Minguk

Kang and

Jaesik

Park . 2020 . Contragan: mations . Distill ( 2018 ). https://doi.org/10.23915/ Contrastive learning for conditional image generadistill . 00011 https://distill.pub/2018/feature-wise- tion. arXiv preprint arXiv: 2006 . 12681 ( 2020 ). transformations. [24] Lisa

Kawai

, Philippe Esling, and Tatsuya Harada.

[12] Ian

Goodfellow

, Jean Pouget-Abadie, Mehdi Mirza, 2020 . Attributes-aware deep music transformation . Bing Xu , David Warde-Farley, Sherjil Ozair, Aaron In Proceedings of the 21st international society for Courville, and Yoshua Bengio . 2014 . Generative music information retrieval conference, ismir . adversarial nets. Advances in neural information [25] Nitish Shirish Keskar , Bryan

McCann

, Lav R . Varshprocessing systems 27 ( 2014 ). ney, Caiming Xiong, and Richard Socher. 2019 .

[13] Gaëtan

Hadjeres

, François Pachet, and Frank

CTRL

A Conditional

Transformer Language Model Nielsen . 2017 . DeepBach: a Steerable Model for for Controllable Generation . CoRR abs/ 1909 .05858

Bach

Chorales Generation . In International Confer- ( 2019 ). arXiv: 1909 .05858 http://arxiv.org/abs/1909. ence on Machine Learning . 1362 - 1371 . 05858

[14] Curtis

Hawthorne

, Andriy Stasyuk, Adam Roberts, [ 26 ] Phuc H Le-Khac , Graham Healy , and Alan F Ian Simon , Cheng-Zhi Anna Huang, Sander Diele- Smeaton . 2020 . Contrastive representation learning: man , Erich Elsen, Jesse Engel, and

Douglas

Eck . A framework and review . IEEE Access ( 2020 ). 2019 . Enabling Factorized Piano Music Modeling [27] Brian

Lester

, Rami Al-Rfou, and

Noah

Constant . and Generation with the MAESTRO Dataset . In In- 2021. The Power of Scale for Parameter-Eficient ternational Conference on Learning Representations. Prompt Tuning . In Proceedings of the Conference on https://openreview.net/forum?id=r1lYRjC9F7 Empirical Methods in Natural Language Processing.

[15] Geofrey

Hinton . 2002 . Training products of ex- Association for Computational Linguistics. perts by minimizing contrastive divergence . Neural [28] Xiang Lisa Li and Percy Liang . 2021 . Prefix-Tuning: computation 14 , 8 ( 2002 ), 1771 - 1800 . Optimizing Continuous Prompts for Generation.

[16] Wen-Yi

Hsiao

, Jen-Yu

Liu

, Yin-Cheng Yeh, and Yi- CoRR abs/2101 .00190 ( 2021 ). arXiv: 2101 .00190

Hsuan

Yang . 2021 . Compound Word Transformer: https://arxiv.org/abs/2101.00190 Learning to Compose Full-Song Music over Dy- [29]

Feynman

Liang . 2016 . BachBot: Automatic componamic Directed Hypergraphs . AAAI ( 2021 ). sition in the style of Bach chorales . Masters thesis,

[17] Cheng-Zhi Anna

Huang

, Tim Cooijmans, Adam University of Cambridge ( 2016 ). Roberts, Aaron Courville, and

Doug

Eck . 2017 . [30] Rui

Liu

, Yixiao Ge, Ching Lam Choi, Xiaogang Counterpoint by Convolution . In Proceedings of the Wang, and Hongsheng Li . 2021 . Divco: Diverse International Conference on Music Information Re- conditional image synthesis via contrastive gentrieval. erative adversarial network . In Proceedings of the

[18] Cheng-Zhi Anna

Huang

, Curtis Hawthorne, Adam

IEEE

/CVF Conference on Computer Vision and PatRoberts, Monica Dinculescu, James Wexler, Leon tern Recognition . 16377 - 16386 . Hong, and

Jacob

Howcroft . 2019 . The Bach Doodle: [31] Antoine

Liutkus

, Ondřej Cıfka, Shih-Lun

, Umut Approachable music composition with machine Simsekli , Yi-Hsuan Yang , and Gael Richard. 2021 . learning at scale . ISMIR ( 2019 ). Relative positional encoding for transformers with

[19] Cheng-Zhi Anna

Huang

, Hendrik Vincent Koops, linear complexity . In International Conference on Ed Newton-Rex, Monica Dinculescu, and Carrie J Machine Learning . PMLR, 7067 - 7079 . Cai. 2020 . AI Song Contest: Human-AI co- creation [32] Ryan

Louie

, Andy Coenen, Cheng Zhi Huang, in songwriting. ISMIR ( 2020 ). Michael Terry, and

Carrie J.

Cai . 2020 . Novice-AI

[20] Cheng-Zhi Anna

Huang

, Ashish Vaswani, Jakob Music Co-Creation via AI-Steering Tools for Deep Uszkoreit , Ian Simon, Curtis Hawthorne, Noam Generative Models. Conference on Human Factors Shazeer , Andrew M Dai,

Matthew D

Hofman , Mon- in Computing Systems (CHI) ( 2020 ). ica Dinculescu, and

Douglas

Eck . 2019 . Music [33] Sageev

Oore

, Ian Simon, Sander Dieleman, DouTransformer. In International Conference on Learn- glas Eck , and Karen Simonyan . 2020 . This time ing Representations. with feeling: Learning expressive musical perfor-

[21] Natasha

Jaques

, Shixiang Gu, Richard E. Turner, mance. Neural Computing and Applications 32 , 4 and

Douglas

Eck . 2016 . Tuning Recurrent Neu- ( 2020 ), 955 - 967 . ral Networks with Reinforcement Learning . CoRR [34] François Pachet and Pierre Roy . 2001 . Musical harabs/1611.02796 ( 2016 ). arXiv: 1611 .02796 http:/ / monization with constraints: A survey. Constraints arxiv . org/abs/1611.02796 6 , 1 ( 2001 ), 7 - 19 .

[35]

Taesung

Park , Alexei A Efros, Richard Zhang, and Uszkoreit, Llion Jones,

Aidan N.

Gomez , Lukasz Jun-Yan Zhu . 2020 . Contrastive learning for un- Kaiser, and

Illia

Polosukhin . 2017 . Attention Is All paired image-to-image translation . In European You Need. CoRR ( 2017 ). Conference on Computer Vision . Springer, 319 - 345 . [49] Ashish

Vaswani

, Noam Shazeer, Niki Parmar, Jakob

[36]

Christine

Payne . 2019 . MuseNet. https://openai. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz com/blog/musenet. Accessed: 2020 -05- 04 . Kaiser, and

Illia

Polosukhin . 2017 . Attention is all

[37] Ethan

Perez

, Douwe Kiela, and Kyunghyun Cho. you need . In Advances in neural information pro2021 . True Few-Shot Learning with Language Mod- cessing systems . 5998 - 6008 . els. CoRR abs/2105 .11447 ( 2021 ). arXiv: 2105 .11447 [50] Ziyu

Wang

, Dingsu

Wang

, Yixiao Zhang, and Gus https://arxiv.org/abs/2105.11447 Xia. 2020 . Learning interpretable representation for

[38] Senthil

Purushwalkam

, Maximilian Nickel, Abhi- controllable polyphonic music generation . ISMIR nav Gupta, and Marc'Aurelio Ranzato . 2019 . Task- ( 2020 ). driven modular networks for zero-shot composi- [51]

Scott

Wilson , David Cottle, and Nick Collins. 2011 . tional learning . In Proceedings of the IEEE/CVF In- The SuperCollider Book. The MIT Press. ternational Conference on Computer Vision . 3593- [52] Shih-Lun Wu and Yi-Hsuan Yang . 2021 . MuseMor3602. phose: Full-Song and Fine-Grained Music Style

[39] Adam

Roberts

, Jesse Engel, Colin Rafel, Curtis Transfer with Just One Transformer VAE . arXiv Hawthorne, and

Douglas

Eck . 2018 . A hierarchical preprint arXiv: 2105 .04090 ( 2021 ). latent vector model for learning long-term structure [53] Jefrey

Zhang , Alexander Sax, Amir Roshan in music . ICML ( 2018 ). Zamir,

Leonidas J.

Guibas , and Jitendra Malik.

[40]

Alex

Sherstinsky . 2018 . Fundamentals of Recurrent 2019 . Side-Tuning: Network Adaptation via AdNeural Network (RNN) and Long Short-Term Mem- ditive Side Networks . CoRR abs/ 1912 .13503 ( 2019 ). ory (LSTM) Network . CoRR abs/ 1808 .03314 ( 2018 ). arXiv: 1912 .13503 http://arxiv.org/abs/ 1912 .13503 arXiv: 1808 .03314 http://arxiv.org/abs/ 1808 .03314

[41] Ian

Simon

, Dan Morris, and

Sumit

Basu . 2008 . MySong: automatic accompaniment generation for A. Musical Features vocal melodies . In Proceedings of the SIGCHI conference on human factors in computing systems. ACM . Note that while some features may appear to be opposites

[42] Ian

Simon

, Adam Roberts, Colin Rafel, Jesse Engel, (e.g., “loud" vs “soft"), and while it is true that they are Curtis Hawthorne, and

Douglas

Eck . 2018 . Learn- mutually exclusive, in fact it is possible for a sequence to ing a latent space of multitrack measures. arXiv satisfy neither (e .g., if it's in a middle dynamic level) . preprint arXiv: 1806 . 00195 ( 2018 ). Absolute features:

[43]

Smolensky . 1986. Information Processing in Dynamical Systems: Foundations of Harmony Theory . 1 . “Loud " - minimum velocity is greater than 60 MIT Press, Cambridge, MA, USA, 194 - 281 . 2. “Soft " - maximum velocity is less than 60

[44] Bob

L Sturm

, Oded Ben-Tal, Una Monaghan, Nick 3 . “Has Dynamic Contrast" - Extreme Dynamic

ConCollins

, Dorien Herremans, Elaine Chew, Gaëtan trast" - The sequence has two notes whose velocHadjeres, Emmanuel Deruty, and François Pachet. ities difer by more than 30 2019 . Machine learning research that matters for 4. “Extreme Dynamic Contrast" - The sequence has music creation: A case study . Journal of New Music two notes whose velocities difer by more than Research 48 , 1 ( 2019 ). 70

[45] Hao

Hao

Tan and

Dorien

Herremans . 2020 . Music 5. “All consonances" - the sequence has no dissofadernets: Controllable music generation based on nances (simultaneous notes with an absolute difhigh-level features via low-level feature modelling . ference modulo 12 of 1 , 2 , 10 , 11, or 6) ISMIR ( 2020 ). 6. “Long sharp dissonance" - the sequence has

[46] Aaron

van den Oord

, Sander Dieleman, Heiga a sharp dissonance (simultaneous notes being Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, played with a diference of 1 or 11) that lasts for Nal Kalchbrenner, Andrew Senior, and Koray a significant amount of time Kavukcuoglu . 2016 . WaveNet: A Generative Model 7. “Only melody" - only a single note is playing at a for Raw Audio . arXiv preprint arXiv:1609.03499 time ( 2016 ).

[47] Aaron

van den Oord

, Oriol Vinyals, and Koray 8. “Few onsets" - when grouping notes according to Kavukcuoglu . 2017 . Neural discrete representation attack time, there are few groups per second learning . NeurIPS ( 2017 ). 9. “Many onsets" - when grouping notes according

[48] Ashish

Vaswani

, Noam Shazeer, Niki Parmar, Jakob to attack time, there are many groups per second

10. “ Blocks of two" - there are groups of two notes being played simultaneously

11. “ Larger blocks" - there are blocks of three or more notes being played simultaneously

12. “ Within single key" - all notes fit within a single major scale