<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Compositional Steering of Music Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Halley Young</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincent Dumoulin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo S. Castro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesse Engel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cheng-Zhi Anna Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Google Brain</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Google Brain</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Pennsylvania</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>Musical composition is a combinatorial art where composers extend sequences by choosing from a vast set of possible feature combinations that yield the compositions their distinctive qualities. Increasingly, composers are using generative models, such as music transformers, for crafting their pieces. Unfortunately, for composers to “steer” these models to satisfy their qualitative features typically requires retraining (which can be prohibitively expensive); further, existing models are unable to deal with arbitrary combinations of features at scale. In this paper we build on lightweight fine-tuning methods, such as prefix tuning and bias tuning, to propose a novel contrastive loss that enables us to steer music transformers over arbitrary combinations of logical features, with a relatively small number of extra parameters. We provide both quantitative and qualitative evaluations of our method which demonstrate its eficacy with respect to existing methods, as well as a case-study where our method was used to compose long-form musical pieces. Musical examples are available for listening online. 1 they would like to control, such as speed, dynamics, harmony, or texture, each of which can be decomposed into Musicians and novices are increasingly experimenting multiple sub-features. Furthermore, the relationship bewith generative models in music making [44, 19, 18]. Yet tween these features and the output is more difuse than generative models are often not trained to support human in the NLP settings: while in examples such as [25], inobjectives (such as controllability) but to maximize proxy dividual words indicate the diferent conditions being metrics that are easy to automate (such as likelihood of satisfied, in music the entire sequence of tokens are used data according to a model), making it dificult for users in evaluating a single feature (for instance, average pitch to steer these models towards expressing users' musical over a span of time depends on all tokens representing intentions [44, 19, 32]. In the language domain, recent that span). This makes compositionality even more chalresearch in making generative models (such as large lan- lenging than in the text generation setting. On the other guage models) more controllable and useful in real-world hand, music is a sequential domain where there are clear applications have focused on ways of adapting uncondi- logical features which can be examined, such as average tioned language models to perform well on conditioned dynamics or number of notes per second. This further motasks they were not initially trained on. Methods such as tivates using music as a test-bed for highly compositional ifne-tuning, side-tuning, bias-tuning, and prompt or pre- tasks involving simultaneously steering an autoregresifx tuning have emerged as lead candidates for such tasks sive model towards several particular desired attributes such as steering the sentiment of or words mentioned according to diferent classifiers. in a sentence [2, 28, 27, 8]. However, such tasks are nor- As a more immediate motivation, consider the followmally presented in a non- or minimally-compositional ing scenario: a composer wants to sample from the preapproach (possibly controlling for two orthogonal vari- trained Music Transformer [20] to complete a musical ables, such as sentiment and topic [25], but rarely more phrase. In addition to the overall musical quality of the than that). continuation, they want it to stay in key and switch to usIn contrast, in the domain of music generation, se- ing block chords (i.e. a few notes played together at once). quence continuation is inherently a highly compositional Repeatedly sampling continuations from the model and problem: the user likely has many aspects of the output cherry-picking a good sample (rejection sampling) would be very labor-intensive, but assuming they can compute binary features for “stays in key” and “uses block chords”, the composer could sample a large number of continuations and cherry-pick from the smaller set of continuations which exhibit all features (thereby delegating</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;creativity</kwd>
        <kwd>co-creativity</kwd>
        <kwd>human-AI co-creation</kwd>
        <kwd>music generation</kwd>
        <kwd>controllable generative models</kwd>
        <kwd>compositionality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Prime</title>
      </sec>
      <sec id="sec-1-2">
        <title>Original continuation</title>
        <p>Only one
block chord
“Uses block</p>
        <p>chords”
Steerable
Transformer</p>
        <p>Many block chords!
part of the accept/reject step). While significantly less form structure, by chaining together chunks steered in
labor-intensive, this solution could potentially require diferent directions (using diferent combinations of
feasampling enormous amounts of continuations if, for in- tures), while maintaining long-term coherence (by
leverstance, the requested features significantly difer from aging the transformers’ full self-attention receptive field).
those of the priming sequence (such that it and the de- Listen to examples on this page 1 for longer steered
exsired continuation form an unlikely sequence according amples, and compositions semi-automatically generated
to the generative model). In other words, the pre-trained using our feature tuning approach in a musician-directed
model could be a poor proposal distribution for some way.
applications.</p>
        <p>
          On the ML side of this work, we are interested in
developing better proposal distributions through an adaptation 2. Problem formulation
approach which can steer the generative model towards
continuations which i) are significantly more likely to Music Transformer is an autoregressive language model
exhibit the requested feature, and ii) exhibit a satisfactory which decomposes the joint probability of a sequence
musical quality. The approach should be able to accom- of tokens 1, . . . ,  (where  ∈ , and  is a set of
modate a large number of features without adding signif- categorical tokens) into
icant memory or computation overhead. We achieve this
by making features composable, making it possible to 
steer features independently and also multiple features at (x = 1, . . . ,  ) = (1) ∏︁ ( | 1, . . . , − 1).
once. Figure 1 shows that when using a pre-trained trans- =2
former model augmented with a relatively small number (1)
of additional parameters, we are able to steer towards ar- It leverages a common modeling approach which
reprebitrary logical music features and achieve realistic music sents the conditional probabilities ( | 1, . . . , − 1)
generation simply by sampling directly from the model. using a neural network [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. As its name implies,
MuIn contrast, the same unconditional transformer model sic Transformer uses a Transformer network
architecfails to produce any examples satisfying those features ture [
          <xref ref-type="bibr" rid="ref35">48</xref>
          ]. Each token  is first mapped to a real-valued
even when generating a hundred samples. embedding e (for instance using a lookup table), then
        </p>
        <p>Our approach can be used to support human-AI co- the network maps each sequence e1, . . . , e− 1 to a
probcreation, where musicians can compose on the level of ability distribution for the value of  over the elements
musical “features” as opposed to notes. Musicians can of .
prototype the high-level “shape” of the music by specify- Sequence continuation in an autoregressive language
ing how various features change throughout the piece, model works by repeatedly sampling from its
distribuand in turn steer and curate music transformers to cre- tion over the next token given the previous tokens.
Startatively fill in the details. With our method, a composer 1Listen to musical examples at https://storage.googleapis.com/
can control both the short-chunk features, and the long- composing-features/index.html
ing from some priming sequence x = (0, . . . ,  ),
1. Positively: given a prime–continuation pair
we first sample
+1 ∼</p>
        <p>(· |
+2 ∼
end of the continued sequence x = (+1, . . . ,  ). 2</p>
        <p>(· | 0, . . . ,  , +1), and so on, until the
Many downstream tasks can be cast as sequence
continuation problems, including the steerable music generation</p>
        <p>0, . . . ,  ), then
problem investigated in this work.</p>
        <p>We are given a set of features Φ =


continuation pair exhibits that feature (which we note as</p>
        <p>→ {0, 1}. Each  takes the value 1 if a
prime(x, x) |=  ), and 0 otherwise. Note that the features
must take both prime and continuation sequences as
input, since some continuation features may be relative
to the priming sequence (e.g. “significantly higher pitch”).</p>
        <p>Our true objective with respect to feature  is to steer
the model towards a distribution which maximizes</p>
        <p>{ }=1,  ∈
Ex|x ︀[ (x, x) |=  ]︀
(x, x) |=  , we find an adaptation   of the
model’s parameters  that minimizes ✓ (we use
the symbol ✓ to denote the fact that  is
computed using the correct parameters   ). By using
prime–continuation examples that sound musical,
we ensure that the steered model stays musically
grounded.</p>
        <sec id="sec-1-2-1">
          <title>2. Negatively: we can also take advantage of other</title>
          <p>features  for which (x, x) ̸|= , by
maximizing × (we use the symbol ×
that  is computed using the incorrect parameters</p>
          <p>to denote the fact
 ).</p>
          <p>The positive case corresponds to maximum-likelihood
training. Additionally, we can exploit the intuition that
the adapted parameters   should “explain” the prime–
(2)
continuation pair (x, x) |</p>
          <p>=  better than   (for
some feature  for which (x, x) ̸|= ) or  (the
nonadapted model parameters, with a corresponding loss
while maintaining musicality.</p>
          <p>This objective is
sider the problem of composed features Φ ˆ , i.e.</p>
          <p>In addition to the single-feature problem, we also con- loss of the form
non-diferentiable because
diferentiable satisfiability criterion.</p>
          <p>(x, x) |=  is a non- ∅). In other words, we can maximize the probability of
choosing   over   and  by minimizing a contrastive
(x, x) |= Φ ˆ
≡
(x, x) |=  ,</p>
          <p>(3)
⋀︁
 ∈Φ^ ⊆ Φ
to account for scenarios where a user is interested in
steering the model towards multiple features (such as
in the “stays in key” and “uses block chords” scenario
discussed in the introduction).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Proposed approach</title>
      <p>
        We start by describing our proposed approach in the
single-feature case and later on explain how we adapt it
to the compositional case.
3.1. Likelihood-based training
While approaches using reinforcement learning—such
as KL-regularized deep Q-learning [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]—could be used
to overcome the non-diferentiability problem, in this
work we consider a proxy loss in the form of the negative
log-likelihood
 = − log  (x | x),
(4)
which we use in two ways:
2To simplify the discussion, we assume a fixed sequence length
 , but the explanation applies to sequences of varying lengths as
well.
− log
︂(
      </p>
      <p>(x | x)
  (x | x) +   (x | x) +  (x | x)
= − log
︂(</p>
      <p>− ✓
− ✓ + − × + − ∅</p>
      <p>We propose a loss that interpolates between Equations
4 and 5 using an  coeficient (which is treated as a
hyperparameter):
ℒ =  · ✓ − (1−  )· log
︂(</p>
      <p>− ✓
− ✓ + − × + − ∅
︂)</p>
      <p>Intuitively, the maximum-likelihood setting should
sufice to achieve our adaptation goals, but in practice
we find that the approach benefits from the inclusion of
negative cases through a contrastive loss term. We tried
diferent  values and found 
= 0.8 to work well in
practice. See Figure 2 for an illustration of the training
setup.
3.2. Feature-conditional adaptation
Fine-tuning all model parameters can be prohibitive if the
number  of features is large (let alone combinatorially
large in the compositional case); however, recent work
provides efective and lightweight alternatives:
1. Prefix tuning</p>
      <p>[28] works by prepending
learnable task embeddings e−  , . . . , e− 1 to the
priming sequence embeddings e1, . . . , e . The loss</p>
      <p>Maestro dataset
Prime</p>
      <p>Continuation</p>
      <p>Steerable</p>
      <p>Transformer
Unconditional</p>
      <p>Transformer</p>
      <p>In this section we describe our experimental setup: the
musical features we want to enable users to control, the
procedure for setting up the training data for adapting
the feature-conditional parameters, and the details of the
prefix tuning and bias tuning setup.</p>
      <sec id="sec-2-1">
        <title>In practice, while prefix tuning showed promise in</title>
        <p>the single-feature setting, we were unable to make it
work in the compositional setting. We therefore focus
our investigation on bias-tuning for the compositional
domain.</p>
        <p>In the compositional setting, a naive approach requires
learning 2|Φ| − 1 model adaptations. Instead, we propose
to express the adaptation for a composed feature Φ ˆ as the
combination of the   of its underlying features  ∈ Φ ˆ .
More specifically, for bias-tuning we average the adapted
biases as
  =
1
ˆ
|Φ |  ∈Φ^
∑︁  , .</p>
        <p>(7)</p>
        <p>Note that we do not simply use the above heuristic
to compose feature adaptations post-hoc; we train the
model in a compositional setting (by sampling prime–
continuation pairs (x, x) |= Φ ˆ for various composed
features) so that the single-feature adaptations can learn
to work well in conjunction with each other. See Figure 3</p>
        <p>Musical features To support users in controlling
different aspects of music, such as harmony, texture, speed,
and dynamics, we implemented eighteen features (see
Appendix A for a complete list).</p>
        <p>We included both absolute and relative features.
Absolute features apply only to the continuation, while
relative features describe the relationship between the
prime and the continuation, such as the average pitch
in the continuation being significantly higher than that
in the prime or the rhythmic density in the continuation
being significantly lower than that in the prime. Note
that this type of model should work with any feature
function which takes in sequences and returns a Boolean.
In this paper, we chose examples that have clear musical
meanings and are easy to implement.</p>
        <p>Dataset features Before adapting model parameters
to specific binary musical features, we may first need
to “fine-tune” a model to better fit the general musical
style of a given dataset (in order to support the scenario
where a user brings their own dataset). We refer to these
dataset-level features as .
Input
Concat/Average
MLP</p>
        <p>MLP
Pre x
tuning</p>
        <p>I
n
n
e
r
l
a
y
e
r
+
Average</p>
        <p>I
n
n
e
r
l
a
y
e
r
+
Average
Bias vector
Bias vector
MLP</p>
        <p>MLP</p>
        <p>MLP</p>
        <p>
          MLP
Bias tuning
Output
4.2. Training data and setup
The unconditioned model we adopt as a base for
adaptation is a pretrained music transformer 3 trained on
transcribed YouTube piano performances (where the music
was typically more melodic). To mimic the user bringing
their own dataset with a distinctive style, we use the
Maestro dataset [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], an open-source collection of virtuoso
performances of classical piano music.
        </p>
        <p>To prepare the prime–continuation pairs needed for
likelihood based training, we serialize the Maestro piano
performances into event-based encoding [33] (the same
representation that the pretrained music transformer was
trained on) and then take random crops of length 200
tokens (which lasts approximately 4 to 20 seconds long).
We then split each crop in half, resulting in a prime x
and continuation x pair that is each 100 token long.</p>
        <p>To prepare training sets for feature-specific
adaptations, for each feature  ∈ Φ we take the subset (with
replacement) of all the pairs that exhibit the feature (i.e.
(x, x) |=  ). Note that since all prime–continuation
pairs are drawn from the Maestro dataset, we implicitly
assume that (x, x) |=  always holds.</p>
        <p>Single-feature setting For the non-contrastive
setting, for each feature  we minimize the loss introduced
in Equation 4 to adapt its parameters   to better fit the
set of prime–continuation pairs where (x, x) |=  .</p>
      </sec>
      <sec id="sec-2-2">
        <title>3See blog post "Generating Piano Music with Transformer"</title>
        <p>https://magenta.tensorflow.org/piano-transformer.</p>
        <p>In the contrastive setting, we minimize the
contrastive loss introduced in Equation 6. For each prime–
continuation pair (x, x) |=  used to compute ✓,
we also need to select a “negative” case for
computing × . We achieve this by selecting another feature
 at random, and a prime–continuation pair where both
(x, x) |=  and (x, x) ̸|=  holds.</p>
        <p>Compositional setting Training with a contrastive
loss yielded more efective models in the single-feature
setting (see section 5 for details), hence we adopt the
contrastive loss for all of the subsequent experiments in
the compositional setting.</p>
        <p>
          To prepare training examples for steering any
number of features at once, we first extend Φ so that each
feature has a corresponding “negated” feature (e.g. “has
both loud and soft pitches” vs “no contrast in dynamics”),
bringing the total number of features from |Φ | = 18 to
2 · | Φ | = 36. By definition, each prime–continuation pair
exhibits exactly 18 features. When sampling a prime–
continuation pair (x, x), we compute ✓ with respect
to a random subset of those 18 features (with size drawn
at random from  [
          <xref ref-type="bibr" rid="ref1 ref12 ref38">1, 12</xref>
          ]). We compute × by negating
some of the sampled features.
4.3. Evaluation metrics
During training time, we minimize a likelihood-based
proxy loss to fit the prime–continuations pairs as well as
possible, either non-contrastively or contrastively.
However, there is no guarantee that a model trained this way
with a lower loss would be more efective in being steered
to produce requested musical features. Hence, we
propose to evaluate our “downstream” true objective using
the following two axes:
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>In the section below, we define how we evaluate sampling eficacy and explain how we break down reporting results according to the “inherent” dificulty in steering.</title>
        <p>4.3.1. Sampling eficacy
(x, x) ̸|=  .</p>
        <p>At test time, what we care about is when a user requests
a feature  , regardless of what the prime x is, what
is the probability that a tuned model can achieve it (i.e.
generate continuations x^ that exhibit feature  ).</p>
        <p>Hence, we propose the following sampling eficacy (i.e.
probability of achievement) as our true evaluation metric
to quantify on average using rejection sampling, what
proportion of a model’s generated continuations x^
exhibit feature</p>
      </sec>
      <sec id="sec-2-4">
        <title>The unconditioned music transformer model allows</title>
        <p>us to establish a baseline for “inherently” how dificult
or easy it is to achieve a certain feature by priming the
model and measuring how often its generated
continuation carries that feature.</p>
        <p>Figure 4 shows for each feature (blue dot) the
probability of achieving it when the unconditioned music
transformer was primed with cofactual primes (x-axis) versus</p>
        <p>Ex Ex^|x ︀[ (x, x^) |=  ]︀ (8) counterfactual primes (y-axis). Intuitively, we expect it
where the first expectation is under the data distribu- to be more dificult for a prime that is counterfactual
tion (approximated by sampling primes x from a given to a feature to produce continuations with that feature,
dataset), and the second is under the model being evalu- indeed this is what we observe, i.e. in the figure all the
ated (approximated by conditioning on a prime x, and blue dots are below the diagonal line. Given this
consisthen using the model to generate continuations x^). tency, whenever we report results on sampling eficacy
(probability of achievement) in the paper, we breakdown
Steering dificulty Intuitively, given a prime x, cer- our results to compare how methods fare on the hard
tain musical features would follow more musically than (counterfactual) versus the easier (cofactual) cases.
others. That is, the prime can impact how dificult it is to
steer a model to generate continuations x^ that exhibit a 4.4. Implementation
certain feature  , hence afecting the sampling eficacy
of a model on that feature. Single-feature setting We use a standard prefix
tun</p>
        <p>To address this “confounding factor”, we propose to ing model as per [28], associating every feature  with
breakdown our analysis of sampling eficacy based on if a a prefix  . As in their paper, we used the finding that
feature musically follows a prime or not. To approximate a relatively low-dimensional embedding space of
feathis, we check in the dataset (that a model was trained ture prefixes (200 dimensions) projected to a
higherdimensional embedding (2048 dimensions) using a shared 5.1. Likelihood-based training dynamics
MLP across feature prefixes worked well. The pre-trained
transformer model had no parameters modified, but was Intuitively, the maximum-likelihood setting should
sufchanged to accept these 2048-dimensional vectors as ifce to achieve our adaptation goals, but in practice we
four 512-dimensional vectors prepended before the 512- ifnd that the approach benefits from the inclusion of
negdimensional vectors representing the input tokens (as ative cases through a contrastive loss term (introduced
the transformer’s latent embedding was 512 dimensions). in Equation 6), as demonstrated in Figure 5. The positive
The resulting sequence of vectors were masked using case (blue line) corresponds to the maximum-likelihood
causal attention in order to predict x. term, while the negative case (orange line) corresponds
to the contrastive loss term. The visually “diverging”
Compositional setting The loss used in the composi- training curves in the contrastive setting show that it has
tional setting is almost identical to Equation 6, the main learned to “explain” features using very diferent
paramediference being in the contrastive loss implementation. ters. That is, when evaluating a prime–continuation pair
The features used to compute × are obtained by negating (x, x) with parameters   adapted for feature  that
some of the features used to compute ✓, and the fraction the pair does not exhibit (x, x) ̸|= , the likelihood
 of negated features is used to modulate the interpola- loss becomes very high.
tion coeficient as  ′ =  − min(,  ). The intuition is
that the diferences in steering should be more noticeable 5.2. Sampling eficacy
when comparing feature sets with less overlap.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. The single-feature setting</title>
      <p>We first examine the single-feature setting, before going
into the compositional setting. For the experiments in
this section, we focus on investigating if our proposed
contrastive loss (introduced in Equation 6) is beneficial.</p>
      <p>We find that indeed it is, that training with a contrastive
loss not only enables learning of parameters that better
“explain” musical features in a more “discriminate”
fashion, but also results in a more steerable model, i.e. when
conditioned on a musical feature, more likely to generate
samples that exhibit that feature.</p>
      <p>In this section, we compare how diferent methods
perform “downstream” on the objective we care about, that
is when the user brings a prime and a desired feature,
how likely is a model able to generate a continuation
that exhibits that feature. The sampling eficacy metric
(defined in in subsection 6.1) gives on average a method’s
probability in achieving requested features.</p>
      <p>Procedure In the following experiments, to compute a
method’s sampling eficacy on a feature, we take 512
random primes from the feature’s validation set, and for each
prime generate 10 continuations, and then check what
proportion of the continuations satisfy the requested
feature. We then average among all features to obtain the
average sampling eficacy of a method. This is the
cofactual case. While for the counterfactual case, instead</p>
      <p>Prefix tuning
Cofactual
Counterfactual</p>
      <p>Contrastive
Results In Table 1, we see that prefixing tuning (es- 0.6
pecially with a contrastive loss) significantly increases y
sampling eficacy over the unconditioned model, both ca
in the easier cofactual cases, the harder counterfactual iffc0.4
cases. All of the improvements are of a large margin, for E
example in the hard counterfactual cases, prefix tuning 0.2 Unconditional
with a contrastive loss increases the probabiliy of achiev- Non-contrastive
ing a feature by 70%, and within prefix tuning, the gain Contrastive
from switching from a non-contrastive loss to contrastive 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
loss is 18%. Difficulty</p>
      <p>As sampling eficacy can vary with prime and feature
requested, in Figure 6 we enumerate how diferent meth- Figure 6: Prefix tuning a model (especially with a contrastive
ods perform for each of the musical features (18 total). lsoinssg)leinfceraetausreess.itTshsea
mcoplloinregdefidcoatcsyth(ayt-laixnies)ufporvemrtoisctaollfytchoerr1e8We see prefix tuning (especially with a contrastive loss) spond to the same feature. Each feature’s steering dificulty
yields a higher sampling eficacy (y-axis) for most fea- (x-axis) is quantified as the sampling eficacy of the
unconditures. As contrastive loss yielded more efective models tioned model, which gives a baseline of how dificult or easy
in the single-feature setting, we adopt the contrastive it is to sample that given feature. Most of the green
(nonloss for all of the subsequent experiments in the compo- contrastive) and orange (contrastive) dots are above the
diagsitional setting. onal line, meaning prefix tuning increases the probability of
achieving them. The vertical dotted orange lines show the
improvement in the contrastive setting over the non-contrastive
6. The compositional setting setting.
In the following experiments, we first illustrate how
dififcult compositional steering is under an unconditioned
model. Then, we show quantitatively how our
biastuning method can enable compositional steering with
much higher sampling eficacy, and also qualitatively
how it achieves this while also improving on the musical
quality of the steered examples (compared to if we were
generating from the “tail” of the unconditioned model).
6.1. Sampling eficacy
Unconditioned model Intuitively, we expect it would
be inefective to rely on rejection sampling (on an
unconditioned model) to give us samples that include a specific
large number of features. Under a naive assumption of
independence among features, one would expect that the
eficacy of rejection sampling in achieving a total of 
features would roughly equal , where  is the probability
of a feature being satisfied.</p>
      <p>While the above assumption is naive, we do find that
in the unconditioned setting, the sampling eficacy of the
unconditioned model drops substantially as the number
of requested features increases. For example, when the
number of requested cofactual features is 6, only 0.2% of
the time were all features achieved after rejection
sampling (see the unconditional line in Figure 7).</p>
      <sec id="sec-3-1">
        <title>Bias tuning With bias tuning, sampling eficacy is</title>
        <p>more efective overall and less brittle with respect to the
number of features. We see in Figure 7 4 that even though
sampling eficacy decreases for all methods when more</p>
      </sec>
      <sec id="sec-3-2">
        <title>4Note that the single-feature steering eficacy presented in the</title>
        <p>compositional setting (Figure 7) is much higher than that in Table 1</p>
        <p>Difficulty vs. Efficacy</p>
        <p>Cofactu</p>
        <p>M
U
Bi
Pr</p>
        <p>Pr
1</p>
        <p>2 3 4
Number of steere
lty vs. Efficacy</p>
        <p>Cofactual</p>
        <p>Random
features are requested as expected, bias-tuning remained rejection sampling on the unconditional model.
a viable steering approach even in the 6-feature setting,
with a 7.9% probability of achievement versus only 0.2% Listening study setup The listening test consists of
for the unconditioned model. This meant for the user, questions posed as pairwise comparisons of musical
exusing bias-tuning on average 13 samples is needed to ar- amples, where both examples started with the same prime
rive at one that exhibits their requested 6 features, while as the first half of the example, while one’s continuation
for the unconditioned model it would take 500 samples, was generated by the unconditional model, while another
making the latter an unfeasible approach. was generated with our bias-tuned model. The ordering</p>
        <p>Comparing the other methods, bias tuning to the Mae- within the pairs were randomized. Listeners were asked
stro dataset (without tuning for specific features) im- to rate which one they thought was more musical. To
proves over the unconditioned model slightly (with larger prepare the samples, we randomly picked 120 primes
gains in the 2 to 3 feature case in the cofactual setting), from the test set. For steering each prime, we randomly
and as expected underperforms bias tuning which is chose a three-feature set that was present in the dataset
tuned to also condition on specific features. Bias tuning as the conditioning features.
outperforms prefix tuning consistently in the
compositional setting.
6.2. Musical quality
The above experiments evaluate the efectiveness of
methods in achieving the requested features. To evaluate
the musical quality of these “successful” generations, we
conducted a listening study with musicians to compare
the steering of three features between our overall best
adaptation approach (i.e. bias tuning) and the baseline of
for the single-feature setting. This is due to diference in the
procedure we used in the two settings. In the compositional setting,
we sampled features according to how they were distributed in the
datasets, while in the single-feature setting we sampled features
uniformly at random. Hence, in the compositional setting, features that
appear often in the dataset (perhaps typical of the dataset’s musical
style) is sampled more often. The sampling eficacy is much higher
possibly because these feature are also easier to achieve.</p>
        <p>Results We asked eight musicians to each rate fifteen
pairs. The results show that our bias-tuning approach
was much preferred over the unconditional, and the
results were statistically significant (  &lt; 0.0003). Bias
tuning won 63 of the pairwise comparisons, tied for 26
pairs, and lost for 31 pairs. This shows that our
biastuning approach is not only more efective in steering
features, but also produces musically more compelling
results.</p>
        <p>Discussion It may seem surprising that bias-tuning
was able to produce samples that were perceived as more
musical than the original expressive unconditional model.</p>
        <p>We hypothesize that this is because we are essentially
sampling from the “tail” of the unconditioned model’s
distribution when using rejection sampling (i.e. only
accepting samples that exhibit the requested feature).</p>
        <p>Since the unconditional model is not trained to generate
specific features, we may have to disregard for example
99.8% of the samples generated before finding a satisfac- same prime but with a diferent “vibe” by invoking a
diftory sample (as seen in the 6-feature setting described ferent set of low-level features. 6 This allowed her to
in subsection 6.1). This resulting distribution is very dif- combine her knowledge of non-automated algorithmic
ferent from the distribution of the unconditional model composition to compose a “quasi-algorithmic" suite of
without rejection sampling. variations. Algorithmic composition involves using
com</p>
        <p>We further hypothesize that, via the proxy loss of in- putational logic to choose notes; instead, in this piece,
creasing the likelihood of a set of continuations with a she used computational logic to choose features.
given feature, a bias-tuned model learns more likely ways
to satisfy that given feature. Hence, its continuations are
not only more likely to be efective at satisfying that
feature, but also to do so in a musically likely manner, which
is often rated by listeners as more musical too.</p>
        <p>
          Composer’s reflection The composer’s findings
were that the tool enabled her on every end of the
humanmachine spectrum: as a composer who occasionally
wants to avoid using any generation tools in their final
output, as a composer who wants to co-create in order
7. Case study: Human-AI to minimize both manual coding and manual composing,
and as an algorithmic composer. As a composer trying
co-creation to improve her “manual" composition and analysis skills
within specific scenarios (e.g., fast-paced and tense music
In the previous sections, we evaluated our approach algo- with chromaticism but also a degree of stasis), and who
rithmically, and found that it is quite efective in steering is used to having to manually search for examples of
music transformers to generate continuations with multi- repertoire having such properties in order to learn from
ple requested features present. In this section, we wanted them, the ability to generate a piece of music tailored to
to understand how a generative model with such steer- a specific scenario forms a surprisingly powerful
pedaability can be useful in a human-AI co-creation setting. gogical tool which she found that she can leverage when
As a preliminary study, one asked one of our co-authors, composing by hand.
who was also a composer, to put our model to test by The workflow of co-creating music made her feel like
carrying out computer-assisted composition. she was the author (rather than the computer), but
allowed her to create piano music of a complexity and
User background and creative strategies Our com- virtuosity her piano skills do not currently aford. The
poser had a background in both purely human and algo- composer felt that this tool was even more powerful in
rithmic composition. She experimented with using the the context of algorithmic composition. Her typical
worktool in several settings. In some settings, her goal was to lfow in algorithmic composition involves using tools such
co-create music of a specific high-level “vibe", which she as SuperCollider [51] to adapt an existing algorithm to
accomplished through the inclusion of specific features generate low-level notes; here she could still use
algoand manually curating the generated samples. rithmic thinking, but at a higher level of abstraction. For
instance, to compose the suite of variations presented
• To achieve “pleasant but low energy”, she chose earlier, she wrote a python program to orchestrate which
features such as diatonicism and harmonic stasis, set of musical features are used for each variation. In
block chords, lack of extreme dynamic change, and contrast, while in the past she would invoke a Markov
a mostly low rhythmic density. Listen to this ex- chain to flesh out each variation, here she could invoke a
ample and other “vibes” online. 5 powerful generative model.
• Inspired by a given prime that had a “roller- In this case study, as the composer was also the creator
coaster"-like quality, she wanted to write a piece of this tool, she was in a unique position to explore the
with a cyclic shape which oscillated between full potential of this tool. To make the tool accessible
melodic and textural extremes. She accomplished to a broader audience, future work could include “Hello
this through the coupling and cycling of features AI”-like [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] exercises to introduce how to compose
algosuch as relative and absolute rhythmic density, rithmically with transformers, akin to the ones found in
pitch height, and existence of block chords. textbooks on SuperCollider and other algorithmic
composition environments.
        </p>
        <p>Seeing that it is possible to steer music transformer to
generate specific high-level “vibes” through low-level
features, our composer was inspired to compose a theme
and variation where each variation would continue the</p>
      </sec>
      <sec id="sec-3-3">
        <title>5Listen to diferent co-created “vibes” at https://storage.</title>
        <p>googleapis.com/composing-features/index.html#vibes.</p>
      </sec>
      <sec id="sec-3-4">
        <title>6Listen to a co-created theme and variation at https://storage.</title>
        <p>googleapis.com/composing-features/index.html.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>8. Related Literature</title>
      <p>language-model features, has shown significant
enhancements in few-shot learning, but is typically not performed
compositionally [53].</p>
      <p>Our work builds on language modeling, further exploring
ways to “tune” these models for user control, through
“fine-tuning” approaches such as prefix and bias tuning.</p>
      <p>In particular, we leverage contrastive learning and
compositionality to derive a lightweight augmentation for
steering large transformer-based language models. Our
approach enables human-AI co-creation, enabling users
to compose at a higher level of abstraction by specifying
features while music transformers fills in the notes. In
the following, we provide a brief overview for each of
the aforementioned related research areas.</p>
      <p>Contrastive learning Contrastive learning is used in
representation learning to train a network which maps
“similar” (positive) inputs to nearby representations and
“dissimilar” (negative) inputs far away from the positive
inputs. See [26] for a theoretical framework and overview.</p>
      <p>
        In generative modeling, contrastive divergence [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] was
proposed to train Restricted Boltzmann Machines [
        <xref ref-type="bibr" rid="ref30">43</xref>
        ],
image-to-image translation models [
        <xref ref-type="bibr" rid="ref1 ref22">1, 35, 30</xref>
        ], and
conditional [23] and unconditional [22] generative adverarial
networks [
        <xref ref-type="bibr" rid="ref12 ref38">12</xref>
        ]. Our contrastive formulation difers from
Language models as generative models This work previous work in two ways. First, rather than selecting
would be impossible without the wealth of research on positive and negative examples related to the
conditiontransformer models for sequence generation tasks. Start- ing signal (musical features in our case) and using the
ing with Vaswani et al. [
        <xref ref-type="bibr" rid="ref35">48</xref>
        ], researchers realized that this contrastive loss to predict which example “agrees” with
paradigm enables far more coherent and diverse genera- the signal, we select positive and negative conditioning
tion than the recurrent neural networks typically used signals (i.e. diferent musical features) and use the
conbefore [
        <xref ref-type="bibr" rid="ref27">40</xref>
        ]. Another milestone was the development of trastive loss to predict which conditioning signal explains
GPT-3, which showed that with suficient size/training the prime–continuation pair best. Second, we also treat
data such models could potentially perform few shot the absence of a conditioning signal (i.e. the original
genlearning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Several subsequent papers, however, demon- erative model) as a negative conditioning signal, meaning
strated the inadequacy of few-shot learning for many that we want the model conditioned on the “correct”
mutasks [
        <xref ref-type="bibr" rid="ref24">37</xref>
        ]. A few alternatives to online few-shot learning sical features to explain the prime–continuation pair
bethave since emerged. ter than the unconditional model. Other work leveraging
language models for multiple tasks include CTRL [25]
“Finetuning” language models Prompt tuning was and Plug-and-Play models [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, CTRL has the
initially explored ad-hoc in the context of finding ways disadvantage that it requires knowing the tasks during
to produce interesting output by GPT-3, and was formal- the training of the large language model, while
Plugized by [27]. While originally prompts were designed as and-Play requires multiple passes through the language
tokens from the transformer’s vocabulary, subsequent model, which can be expensive for suficiently large
modstudies generalized them to any embeddings prepended els.
to the input [28]. We extend research into prefix
tuning by considering aggregation methods among com- Controllable generative models for music
Adpositional prefixes. Feature-wise transformations, such vances in sequence modeling [
        <xref ref-type="bibr" rid="ref33 ref34 ref7">46, 49, 47, 7</xref>
        ] has enabled
as elementwise scaling and/or biasing of features in a long-form music generation in both the symbolic
donetwork based on side-information (such as labels for a main [
        <xref ref-type="bibr" rid="ref16 ref20 ref23">33, 20, 36, 31, 16</xref>
        ] and the audio domain [46, 14, 10,
conditioned task), have been applied in a wide variety 9]. Similar to language, researchers in music generation
of problem settings—see [
        <xref ref-type="bibr" rid="ref11 ref37">11</xref>
        ] for an overview. We draw have been adapting these language models towards
condirect inspiration from BitFit’s bias-tuning approach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] trollable generation, such as by conditioning on one part
and cast it as a feature-wise transformation approach. By of a musical piece to complete the rest, such as melody
factoring the additive perturbations in Figure 3 into their harmonization [
        <xref ref-type="bibr" rid="ref28 ref6">41, 29, 6</xref>
        ], or more generally arbitrary
parpreceding layers, our bias-tuning implementation can be tial score completion [
        <xref ref-type="bibr" rid="ref13 ref17">17, 13</xref>
        ]. Representational learning
described as a multi-task variant of BitFit where the pa- approaches such as autoencoders (AEs) and variational
rameterizations are tied across features. A key diference autoencoders (VAEs) have also been used for steering
is that our feature-specific adaptations are designed to interpolations or transformations along learned latent
be composable, which to our best knowledge has not yet dimensions, a low-level disentangled attribute-based
dibeen explored in the context of large language models— mension such as note density [
        <xref ref-type="bibr" rid="ref26">39, 24</xref>
        ], or a high-level
although compositional adaptations using feature-wise learned dimension such as energy level in mood that is
multiplicative interactions have been studied in the con- then realized through its mapping to low-level features
text of zero-shot image classification [
        <xref ref-type="bibr" rid="ref25">38</xref>
        ]. Similarly, side- such as note-density and rhythm [
        <xref ref-type="bibr" rid="ref32">45</xref>
        ], controlling chord
tuning, or summing task-specific features with general progressions and texture independently [
        <xref ref-type="bibr" rid="ref29">42, 50</xref>
        ], or
rearranging a piece to have increased polyphony or rhythmic
density [52].
      </p>
      <p>
        Human-AI co-creation in music Controllable
generative models typically does not ofer users full control (i.e.
only allows users to specify a small number of low-level
or high-level controls, or through an example), while
relying on its learned stylistic distribution and/or features
encoded from the user-specified template piece to fill in
the rest of the musical details [
        <xref ref-type="bibr" rid="ref19 ref31">44, 32, 19</xref>
        ]. In contrast,
traditional constraint satisfaction based music generation
systems do not have prior knowledge of the desired
stylistic distribution, instead rely on users to specify a large
number of musical constraints to guide its search (see [34]
for a survey). When using the former systems, users may
still feel a lack of agency, while the latter can impose a
laborious process. Our approach explores the space in
between, allowing users to compose multiple features along
diferent musical dimensions for short chunks (similar
to constraint specification), while leveraging pretrained
transformers‘ expressiveness to aid users in maintaining
coherence in virtuosic long-form composition.
      </p>
    </sec>
    <sec id="sec-5">
      <title>9. Conclusion</title>
      <p>We have shown that music transformers can be directed
towards a specific generative “task" using some of the
same methods as natural language transformers. In
addition, we have studied compositionality in this domain.
Compositionality (including relatively high levels of
compositionality) is critical in the music domain (and other
domains) if the user wants control over the output. We
establish that compositionality is a hard problem, and
propose adaptations of several solutions from the
literature (bias-tuning and prefix-tuning) to address these
challenges. We find success with bias-tuning, but not with
prefix-tuning. While our results are promising, there is
clearly significant room for improving on the eficacy of
steering a transformer compositionally.</p>
      <p>We have provided a preliminary demonstration of how
our approach, compositional steering, enables human-AI
co-creation, where musicians can compose on the level
of musical “features” as opposed to notes. Musicians
can prototype the high-level “shape” of the music by
specifying how various features change throughout the
piece, and in turn steer and curate music transformers to
creatively fill in the details. We envision future work to
enable end-user machine learning where users can define
their own features or provide their own musical examples,
and leverage our lightweight compositional bias-tuning
approach to learn new controls to steer expressive music
transformers compositionally.
1. “Significantly lower average pitch than prime"
2. “Significantly higher average pitch than prime"
3. “Significantly higher number of grouped attacks
than prime"
4. “Significantly lower number of grouped attacks
than prime"
5. “Significantly more notes per second than prime"
6. “Significantly fewer notes per second than prime"
7. “Could fit in the same key as the prime"
8. “Has 2 or more pitch classes (pitch mod 12) which
the prime doesn’t have"</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Kyungjune</given-names>
            <surname>Baek</surname>
          </string-name>
          , Yunjey Choi, Youngjung Uh, Jaejun Yoo, and
          <string-name>
            <given-names>Hyunjung</given-names>
            <surname>Shim</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Rethinking the truly unsupervised image-to-image translation</article-title>
          .
          <source>In Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          . 14154-
          <fpage>14163</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Elad</given-names>
            <surname>Ben-Zaken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Shauli</given-names>
            <surname>Ravfogel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>BitFit: Simple Parameter-eficient Finetuning for Transformer-based Masked Languagemodels</article-title>
          .
          <source>ArXiv abs/2106</source>
          .10199 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Réjean Ducharme, Pascal Vincent, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Janvin</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>A neural probabilistic language model</article-title>
          .
          <source>The journal of machine learning research 3</source>
          (
          <year>2003</year>
          ),
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Brown</surname>
          </string-name>
          , Benjamin Mann, Nick Ryder, Melanie Subbiah,
          <string-name>
            <surname>Jared D Kaplan</surname>
            , Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jefrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler,
            <given-names>Mateusz</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
          </string-name>
          , Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner,
          <string-name>
            <surname>Sam</surname>
            <given-names>McCandlish</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Language Models are Few-Shot Learners</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Carrie</surname>
            <given-names>J Cai</given-names>
          </string-name>
          , Samantha Winter, David Steiner,
          <string-name>
            <given-names>Lauren</given-names>
            <surname>Wilcox</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Terry</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>"Hello AI": Uncovering the onboarding needs of medical practitioners for human-AI collaborative decisionmaking</article-title>
          .
          <source>Proceedings of the ACM on Humancomputer Interaction 3</source>
          ,
          <string-name>
            <surname>CSCW</surname>
          </string-name>
          (
          <year>2019</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Kristy</given-names>
            <surname>Choi</surname>
          </string-name>
          , Curtis Hawthorne, Ian Simon, Monica Dinculescu, and
          <string-name>
            <given-names>Jesse</given-names>
            <surname>Engel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Encoding musical style with transformer autoencoders</article-title>
          .
          <source>In International Conference on Machine Learning. PMLR</source>
          ,
          <year>1899</year>
          -
          <fpage>1908</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Krzysztof</given-names>
            <surname>Choromanski</surname>
          </string-name>
          , Valerii Likhosherstov, David Dohan,
          <string-name>
            <given-names>Xingyou</given-names>
            <surname>Song</surname>
          </string-name>
          , Andreea Gane, Tamas Sarlos,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Hawkins</surname>
          </string-name>
          , Jared Davis, Afroz Mohiuddin,
          <string-name>
            <given-names>Lukasz</given-names>
            <surname>Kaiser</surname>
          </string-name>
          , et al.
          <year>2020</year>
          .
          <article-title>Rethinking attention with performers</article-title>
          .
          <source>ICLR</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Sumanth</given-names>
            <surname>Dathathri</surname>
          </string-name>
          , Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          , and Rosanne Liu.
          <year>2020</year>
          .
          <article-title>Plug and Play Language Models: A Simple Approach to Controlled Text Generation</article-title>
          . ArXiv abs/
          <year>1912</year>
          .02164 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Prafulla</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          , Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Jukebox: A Generative Model for Music</article-title>
          . arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>00341</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sander</surname>
            <given-names>Dieleman</given-names>
          </string-name>
          , Aaron van den Oord, and
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The challenge of realistic music generation: modelling raw audio at scale</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          . [22]
          <string-name>
            <given-names>Alexia</given-names>
            <surname>Jolicoeur-Martineau</surname>
          </string-name>
          .
          <year>2018</year>
          . The relativistic
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Vincent</surname>
            <given-names>Dumoulin</given-names>
          </string-name>
          , Ethan Perez,
          <article-title>Nathan Schucher, discriminator: a key element missing from standard Florian Strub</article-title>
          , Harm de Vries, Aaron Courville, GAN. arXiv preprint arXiv:
          <year>1807</year>
          .
          <volume>00734</volume>
          (
          <year>2018</year>
          ). and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Feature-wise transfor-</article-title>
          [23]
          <string-name>
            <given-names>Minguk</given-names>
            <surname>Kang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jaesik</given-names>
            <surname>Park</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Contragan: mations</article-title>
          .
          <source>Distill</source>
          (
          <year>2018</year>
          ). https://doi.org/10.23915/
          <article-title>Contrastive learning for conditional image generadistill</article-title>
          .
          <volume>00011</volume>
          https://distill.pub/2018/feature-wise- tion. arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>12681</volume>
          (
          <year>2020</year>
          ). transformations. [24]
          <string-name>
            <surname>Lisa</surname>
            <given-names>Kawai</given-names>
          </string-name>
          , Philippe Esling, and Tatsuya Harada.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ian</surname>
            <given-names>Goodfellow</given-names>
          </string-name>
          , Jean Pouget-Abadie, Mehdi Mirza,
          <year>2020</year>
          .
          <article-title>Attributes-aware deep music transformation</article-title>
          .
          <source>Bing Xu</source>
          , David Warde-Farley, Sherjil Ozair,
          <source>Aaron In Proceedings of the 21st international society for Courville, and Yoshua Bengio</source>
          .
          <year>2014</year>
          .
          <article-title>Generative music information retrieval conference, ismir</article-title>
          . adversarial nets.
          <source>Advances in neural information [25] Nitish Shirish Keskar</source>
          ,
          <string-name>
            <surname>Bryan</surname>
            <given-names>McCann</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lav R</surname>
          </string-name>
          . Varshprocessing systems
          <volume>27</volume>
          (
          <year>2014</year>
          ). ney, Caiming Xiong, and Richard Socher.
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Gaëtan</surname>
            <given-names>Hadjeres</given-names>
          </string-name>
          , François Pachet, and
          <string-name>
            <surname>Frank</surname>
            <given-names>CTRL</given-names>
          </string-name>
          :
          <string-name>
            <given-names>A Conditional</given-names>
            <surname>Transformer Language Model Nielsen</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DeepBach: a Steerable Model for for Controllable Generation</article-title>
          . CoRR abs/
          <year>1909</year>
          .05858
          <string-name>
            <given-names>Bach</given-names>
            <surname>Chorales</surname>
          </string-name>
          <article-title>Generation</article-title>
          . In International Confer- (
          <year>2019</year>
          ). arXiv:
          <year>1909</year>
          .05858 http://arxiv.org/abs/1909. ence on
          <source>Machine Learning</source>
          .
          <fpage>1362</fpage>
          -
          <lpage>1371</lpage>
          .
          <fpage>05858</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Curtis</surname>
            <given-names>Hawthorne</given-names>
          </string-name>
          , Andriy Stasyuk, Adam Roberts, [
          <volume>26</volume>
          ]
          <string-name>
            <surname>Phuc H Le-Khac</surname>
            ,
            <given-names>Graham</given-names>
          </string-name>
          <string-name>
            <surname>Healy</surname>
          </string-name>
          , and
          <string-name>
            <surname>Alan F Ian Simon</surname>
          </string-name>
          , Cheng-Zhi Anna Huang,
          <source>Sander Diele- Smeaton</source>
          .
          <year>2020</year>
          .
          <article-title>Contrastive representation learning: man</article-title>
          , Erich Elsen, Jesse Engel, and
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Eck</surname>
          </string-name>
          .
          <article-title>A framework and review</article-title>
          .
          <source>IEEE Access</source>
          (
          <year>2020</year>
          ).
          <year>2019</year>
          . Enabling Factorized Piano Music Modeling [27]
          <string-name>
            <surname>Brian</surname>
            <given-names>Lester</given-names>
          </string-name>
          , Rami Al-Rfou, and
          <string-name>
            <given-names>Noah</given-names>
            <surname>Constant</surname>
          </string-name>
          .
          <article-title>and Generation with the MAESTRO Dataset</article-title>
          . In In- 2021.
          <article-title>The Power of Scale for Parameter-Eficient ternational Conference on Learning Representations. Prompt Tuning</article-title>
          .
          <source>In Proceedings of the Conference on https://openreview.net/forum?id=r1lYRjC9F7 Empirical Methods in Natural Language Processing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Geofrey</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Training products of ex- Association for Computational Linguistics. perts by minimizing contrastive divergence</article-title>
          .
          <source>Neural [28] Xiang Lisa Li and Percy Liang</source>
          .
          <year>2021</year>
          .
          <article-title>Prefix-Tuning: computation 14</article-title>
          <issue>, 8</issue>
          (
          <year>2002</year>
          ),
          <fpage>1771</fpage>
          -
          <lpage>1800</lpage>
          .
          <article-title>Optimizing Continuous Prompts for Generation.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Wen-Yi</surname>
            <given-names>Hsiao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jen-Yu</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Yin-Cheng Yeh, and Yi- CoRR
          <source>abs/2101</source>
          .00190 (
          <year>2021</year>
          ). arXiv:
          <volume>2101</volume>
          .00190
          <string-name>
            <given-names>Hsuan</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>2021</year>
          . Compound Word Transformer: https://arxiv.org/abs/2101.00190 Learning to Compose Full-Song Music over Dy- [29]
          <string-name>
            <given-names>Feynman</given-names>
            <surname>Liang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>BachBot: Automatic componamic Directed Hypergraphs</article-title>
          .
          <source>AAAI</source>
          (
          <year>2021</year>
          ).
          <article-title>sition in the style of Bach chorales</article-title>
          .
          <source>Masters thesis,</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Cheng-Zhi Anna</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Tim Cooijmans, Adam University of Cambridge (
          <year>2016</year>
          ). Roberts, Aaron Courville, and
          <string-name>
            <given-names>Doug</given-names>
            <surname>Eck</surname>
          </string-name>
          .
          <year>2017</year>
          . [30]
          <string-name>
            <surname>Rui</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Yixiao Ge, Ching Lam Choi,
          <article-title>Xiaogang Counterpoint by Convolution</article-title>
          .
          <source>In Proceedings of the Wang, and Hongsheng Li</source>
          .
          <year>2021</year>
          . Divco: Diverse International Conference on Music Information Re-
          <article-title>conditional image synthesis via contrastive gentrieval. erative adversarial network</article-title>
          .
          <source>In Proceedings of the</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Cheng-Zhi Anna</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Curtis Hawthorne,
          <string-name>
            <surname>Adam</surname>
            <given-names>IEEE</given-names>
          </string-name>
          /CVF Conference on Computer Vision and PatRoberts, Monica Dinculescu, James Wexler,
          <source>Leon tern Recognition</source>
          .
          <volume>16377</volume>
          -
          <fpage>16386</fpage>
          . Hong, and
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Howcroft</surname>
          </string-name>
          .
          <year>2019</year>
          . The Bach Doodle: [31]
          <string-name>
            <surname>Antoine</surname>
            <given-names>Liutkus</given-names>
          </string-name>
          , Ondřej Cıfka,
          <string-name>
            <surname>Shih-Lun</surname>
            <given-names>Wu</given-names>
          </string-name>
          ,
          <article-title>Umut Approachable music composition with machine Simsekli</article-title>
          ,
          <string-name>
            <surname>Yi-Hsuan Yang</surname>
          </string-name>
          , and Gael Richard.
          <year>2021</year>
          .
          <article-title>learning at scale</article-title>
          .
          <source>ISMIR</source>
          (
          <year>2019</year>
          ).
          <article-title>Relative positional encoding for transformers with</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Cheng-Zhi Anna</surname>
            <given-names>Huang</given-names>
          </string-name>
          ,
          <article-title>Hendrik Vincent Koops, linear complexity</article-title>
          . In International Conference on Ed Newton-Rex,
          <article-title>Monica Dinculescu, and Carrie J Machine Learning</article-title>
          . PMLR,
          <fpage>7067</fpage>
          -
          <lpage>7079</lpage>
          . Cai.
          <year>2020</year>
          .
          <article-title>AI Song Contest: Human-AI co-</article-title>
          creation [32]
          <string-name>
            <surname>Ryan</surname>
            <given-names>Louie</given-names>
          </string-name>
          , Andy Coenen, Cheng Zhi Huang, in songwriting.
          <source>ISMIR</source>
          (
          <year>2020</year>
          ). Michael Terry, and
          <string-name>
            <given-names>Carrie J.</given-names>
            <surname>Cai</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Novice-AI</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Cheng-Zhi Anna</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Ashish Vaswani,
          <string-name>
            <surname>Jakob Music</surname>
          </string-name>
          Co-Creation via
          <article-title>AI-Steering Tools for Deep Uszkoreit</article-title>
          , Ian Simon, Curtis Hawthorne,
          <source>Noam Generative Models. Conference on Human Factors Shazeer</source>
          , Andrew M Dai,
          <string-name>
            <given-names>Matthew D</given-names>
            <surname>Hofman</surname>
          </string-name>
          , Mon- in
          <source>Computing Systems (CHI)</source>
          (
          <year>2020</year>
          ). ica Dinculescu, and
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Eck</surname>
          </string-name>
          .
          <year>2019</year>
          . Music [33]
          <string-name>
            <surname>Sageev</surname>
            <given-names>Oore</given-names>
          </string-name>
          , Ian Simon, Sander Dieleman, DouTransformer. In International Conference on Learn- glas
          <string-name>
            <surname>Eck</surname>
            , and
            <given-names>Karen</given-names>
          </string-name>
          <string-name>
            <surname>Simonyan</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>This time ing Representations. with feeling: Learning expressive musical perfor-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Natasha</surname>
            <given-names>Jaques</given-names>
          </string-name>
          , Shixiang Gu, Richard E. Turner, mance.
          <source>Neural Computing and Applications</source>
          <volume>32</volume>
          ,
          <article-title>4</article-title>
          and
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Eck</surname>
          </string-name>
          .
          <year>2016</year>
          . Tuning Recurrent Neu- (
          <year>2020</year>
          ),
          <fpage>955</fpage>
          -
          <lpage>967</lpage>
          .
          <article-title>ral Networks with Reinforcement Learning</article-title>
          .
          <source>CoRR [34] François Pachet and Pierre Roy</source>
          .
          <year>2001</year>
          . Musical harabs/1611.02796 (
          <year>2016</year>
          ). arXiv:
          <volume>1611</volume>
          .02796 http:/
          <article-title>/ monization with constraints: A survey. Constraints arxiv</article-title>
          .
          <source>org/abs/1611.02796 6</source>
          ,
          <issue>1</issue>
          (
          <year>2001</year>
          ),
          <fpage>7</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Taesung</given-names>
            <surname>Park</surname>
          </string-name>
          , Alexei A Efros, Richard Zhang, and Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz
          <string-name>
            <surname>Jun-Yan Zhu</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Contrastive learning for un- Kaiser, and</article-title>
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention Is All paired image-to-image translation</article-title>
          .
          <source>In European You Need. CoRR</source>
          (
          <year>2017</year>
          ).
          <source>Conference on Computer Vision</source>
          . Springer,
          <fpage>319</fpage>
          -
          <lpage>345</lpage>
          . [49]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Christine</given-names>
            <surname>Payne</surname>
          </string-name>
          .
          <year>2019</year>
          . MuseNet. https://openai. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz com/blog/musenet. Accessed:
          <fpage>2020</fpage>
          -05-
          <lpage>04</lpage>
          . Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          . Attention is all
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Ethan</surname>
            <given-names>Perez</given-names>
          </string-name>
          , Douwe Kiela, and Kyunghyun Cho.
          <article-title>you need</article-title>
          .
          <source>In Advances in neural information pro2021</source>
          .
          <article-title>True Few-Shot Learning with Language Mod- cessing systems</article-title>
          .
          <volume>5998</volume>
          -
          <fpage>6008</fpage>
          . els.
          <source>CoRR abs/2105</source>
          .11447 (
          <year>2021</year>
          ). arXiv:
          <volume>2105</volume>
          .11447 [50]
          <string-name>
            <surname>Ziyu</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dingsu</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Yixiao Zhang, and Gus https://arxiv.org/abs/2105.11447 Xia.
          <year>2020</year>
          .
          <article-title>Learning interpretable representation for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Senthil</surname>
            <given-names>Purushwalkam</given-names>
          </string-name>
          , Maximilian Nickel,
          <article-title>Abhi- controllable polyphonic music generation</article-title>
          .
          <source>ISMIR nav Gupta, and Marc'Aurelio Ranzato</source>
          .
          <year>2019</year>
          . Task- (
          <year>2020</year>
          ).
          <article-title>driven modular networks for zero-shot composi-</article-title>
          [51]
          <string-name>
            <given-names>Scott</given-names>
            <surname>Wilson</surname>
          </string-name>
          , David Cottle, and Nick Collins.
          <year>2011</year>
          .
          <article-title>tional learning</article-title>
          .
          <source>In Proceedings of the IEEE/CVF In- The SuperCollider Book. The MIT Press. ternational Conference on Computer Vision</source>
          . 3593- [52]
          <string-name>
            <surname>Shih-Lun Wu</surname>
          </string-name>
          and
          <string-name>
            <surname>Yi-Hsuan Yang</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>MuseMor3602. phose: Full-Song and</article-title>
          <string-name>
            <surname>Fine-Grained Music</surname>
          </string-name>
          Style
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Adam</surname>
            <given-names>Roberts</given-names>
          </string-name>
          , Jesse Engel, Colin Rafel,
          <article-title>Curtis Transfer with Just One Transformer VAE</article-title>
          . arXiv Hawthorne, and
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Eck</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A hierarchical preprint</article-title>
          arXiv:
          <volume>2105</volume>
          .04090 (
          <year>2021</year>
          ).
          <article-title>latent vector model for learning long-term structure</article-title>
          [53]
          <string-name>
            <surname>Jefrey</surname>
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Alexander Sax,
          <article-title>Amir Roshan in music</article-title>
          .
          <source>ICML</source>
          (
          <year>2018</year>
          ). Zamir,
          <string-name>
            <given-names>Leonidas J.</given-names>
            <surname>Guibas</surname>
          </string-name>
          , and Jitendra Malik.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Sherstinsky</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Fundamentals of Recurrent 2019</article-title>
          .
          <article-title>Side-Tuning: Network Adaptation via AdNeural Network (RNN) and Long Short-Term Mem- ditive Side Networks</article-title>
          . CoRR abs/
          <year>1912</year>
          .13503 (
          <year>2019</year>
          ).
          <article-title>ory (LSTM) Network</article-title>
          . CoRR abs/
          <year>1808</year>
          .03314 (
          <year>2018</year>
          ). arXiv:
          <year>1912</year>
          .13503 http://arxiv.org/abs/
          <year>1912</year>
          .13503 arXiv:
          <year>1808</year>
          .03314 http://arxiv.org/abs/
          <year>1808</year>
          .03314
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [41]
          <string-name>
            <surname>Ian</surname>
            <given-names>Simon</given-names>
          </string-name>
          , Dan Morris, and
          <string-name>
            <given-names>Sumit</given-names>
            <surname>Basu</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>MySong: automatic accompaniment generation for A. Musical Features vocal melodies</article-title>
          .
          <source>In Proceedings of the SIGCHI conference on human factors in computing systems. ACM</source>
          .
          <article-title>Note that while some features may appear to be opposites</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [42]
          <string-name>
            <surname>Ian</surname>
            <given-names>Simon</given-names>
          </string-name>
          , Adam Roberts, Colin Rafel, Jesse Engel, (e.g.,
          <article-title>“loud" vs “soft"), and while it is true that they are Curtis Hawthorne, and</article-title>
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Eck</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Learn- mutually exclusive, in fact it is possible for a sequence to ing a latent space of multitrack measures. arXiv satisfy neither (e</article-title>
          .g.,
          <article-title>if it's in a middle dynamic level)</article-title>
          . preprint arXiv:
          <year>1806</year>
          .
          <volume>00195</volume>
          (
          <year>2018</year>
          ). Absolute features:
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>P.</given-names>
            <surname>Smolensky</surname>
          </string-name>
          .
          <source>1986. Information Processing in Dynamical Systems: Foundations of Harmony Theory</source>
          .
          <volume>1</volume>
          .
          <string-name>
            <surname>“Loud</surname>
          </string-name>
          "
          <article-title>- minimum velocity is greater than 60</article-title>
          MIT Press, Cambridge, MA, USA,
          <fpage>194</fpage>
          -
          <lpage>281</lpage>
          . 2.
          <string-name>
            <surname>“Soft</surname>
          </string-name>
          "
          <article-title>- maximum velocity is less than 60</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [44]
          <string-name>
            <surname>Bob</surname>
            <given-names>L Sturm</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oded</surname>
          </string-name>
          Ben-Tal,
          <source>Una Monaghan, Nick</source>
          <volume>3</volume>
          .
          <string-name>
            <surname>“Has Dynamic Contrast" - Extreme Dynamic</surname>
            <given-names>ConCollins</given-names>
          </string-name>
          , Dorien Herremans, Elaine Chew,
          <article-title>Gaëtan trast" - The sequence has two notes whose velocHadjeres, Emmanuel Deruty, and François Pachet. ities difer by more than 30 2019</article-title>
          .
          <article-title>Machine learning research that matters for 4. “Extreme Dynamic Contrast" - The sequence has music creation: A case study</article-title>
          .
          <source>Journal of New Music two notes whose velocities difer by more than Research</source>
          <volume>48</volume>
          ,
          <issue>1</issue>
          (
          <year>2019</year>
          ).
          <fpage>70</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [45]
          <string-name>
            <surname>Hao</surname>
            <given-names>Hao</given-names>
          </string-name>
          Tan and
          <string-name>
            <given-names>Dorien</given-names>
            <surname>Herremans</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Music 5. “All consonances" - the sequence has no dissofadernets: Controllable music generation based on nances (simultaneous notes with an absolute difhigh-level features via low-level feature modelling</article-title>
          .
          <source>ference modulo 12 of 1</source>
          ,
          <issue>2</issue>
          ,
          <issue>10</issue>
          , 11, or 6)
          <string-name>
            <surname>ISMIR</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>6. “Long sharp dissonance" - the sequence has</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [46]
          <string-name>
            <surname>Aaron</surname>
            <given-names>van den Oord</given-names>
          </string-name>
          , Sander Dieleman,
          <article-title>Heiga a sharp dissonance (simultaneous notes being Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, played with a diference of 1 or 11) that lasts for Nal Kalchbrenner, Andrew Senior, and Koray a significant amount of time Kavukcuoglu</article-title>
          .
          <year>2016</year>
          .
          <article-title>WaveNet: A Generative Model 7. “Only melody" - only a single note is playing at a for Raw Audio</article-title>
          .
          <source>arXiv preprint arXiv:1609.03499 time</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [47]
          <string-name>
            <surname>Aaron</surname>
            <given-names>van den Oord</given-names>
          </string-name>
          ,
          <source>Oriol Vinyals, and Koray 8. “Few onsets" - when grouping notes according to Kavukcuoglu</source>
          .
          <year>2017</year>
          .
          <article-title>Neural discrete representation attack time, there are few groups per second learning</article-title>
          .
          <source>NeurIPS</source>
          (
          <year>2017</year>
          ).
          <article-title>9. “Many onsets" - when grouping notes according</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [48]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar,
          <article-title>Jakob to attack time, there are many groups per second</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          10. “
          <article-title>Blocks of two" - there are groups of two notes being played simultaneously</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          11. “
          <article-title>Larger blocks" - there are blocks of three or more notes being played simultaneously</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          12. “
          <article-title>Within single key" - all notes fit within a single major scale</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>