<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emanuele Cosenza</string-name>
          <email>e.cosenza3@studenti.unipi.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Valenti</string-name>
          <email>andrea.valenti@phd.unipi.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Bacciu</string-name>
          <email>davide.bacciu@unipi.it</email>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>Graphs can be leveraged to model polyphonic multitrack symbolic music, where notes, chords and entire sections may be linked at diferent levels of the musical hierarchy by tonal and rhythmic relationships. Nonetheless, there is a lack of works that consider graph representations in the context of deep learning systems for music generation. This paper bridges this gap by introducing a novel graph representation for music and a deep Variational Autoencoder that generates the structure and the content of musical graphs separately, one after the other, with a hierarchical architecture that matches the structural priors of music. By separating the structure and content of musical graphs, it is possible to condition generation by specifying which instruments are played at certain times. This opens the door to a new form of human-computer interaction in the context of music co-creation. After training the model on existing MIDI datasets, the experiments show that the model is able to generate appealing short and long musical sequences and to realistically interpolate between them, producing music that is tonally and rhythmically consistent. Finally, the visualization of the embeddings shows that the model is able to organize its latent space in accordance with known musical concepts.</p>
      </abstract>
      <kwd-group>
        <kwd>Symbolic music generation</kwd>
        <kwd>Variational Autoencoders</kwd>
        <kwd>Deep Graph Networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The automatic generation of artistic artifacts is gathering increasing interest, also thanks to the
possibilities ofered by modern deep generative models [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Despite these achievements, a
closer inspection is often enough to detect if a piece of art is the outcome of an automatic artificial
process. While being very good at approximating the external appearance of the artworks,
artificial models still lack a way to convey an artistic message to the overall experience. This
results in artworks that are convincing but soulless, lacking a general coherence and a deeper
meaning. This is particularly true in the case of music, where the artist needs to be very aware
of the emotions evoked by a particular sequence of notes in order to stimulate a specific mood
in the listener.
      </p>
      <p>A way to circumvent the above issues is to look at deep learning models as a support to the
human artist, instead of as a replacement. The models can thus be used as a way to automatize
the low-level routine sub-tasks of the creative process, while leaving the artist free to concentrate
on the overall picture.</p>
      <p>In this paper, we introduce a new model for the automatic generation of symbolic sequences
of multitrack, polyphonic music. The generation process is carried out through the use of a
novel graph-based internal representation, which allows to explicitly model the diferent chords
in the song and the relations between them. This representation allows the human artist to
perform controlled changes to the output of the neural network in order to control specific
aspects of the artistic performance, while leaving the model free to generate the remaining part
in a coherent way.</p>
      <p>
        The main contributions of this paper are the following. First, we propose a novel graph
representation of multitrack, polyphonic music, where nodes represent the chords played by
diferent instruments and edges model the relationships between them. Second, we introduce a
deep Variational Autoencoder [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that generates musical graphs by separating their rhythmic
structure and tonal content. To the best of our knowledge, this is the first time in literature that
Deep Graph Networks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] are used to generate multitrack, polyphonic music. Finally, we show
a new generative scenario enabled by our approach in which the user can intuitively condition
generation by specifying which instruments have to be played at specific timesteps.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        In recent years, there have been many attempts at generating symbolic music with deep learning
architectures such as LSTMs [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ], Transformers [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ] and CNNs [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], using
Variational Autoencoders (VAE) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Generative Adversarial Networks (GAN) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and Adversarial
Autoencoders (AAE) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] as the main generative frameworks.
      </p>
      <p>
        One of the challenges in devising symbolic generators is choosing an appropriate
representation for music data. Researchers have therefore started to experiment with graph-based
representations, where musical entities and their relationships are modeled, respectively, by
nodes and edges. Musical graphs have been built at the note level [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17, 18</xref>
        ], associating
nodes to notes and edges to temporal or tonal relationships, as well as at a higher level of the
hierarchy, using melodic segments [19] and bars [20, 21] as building blocks.
      </p>
      <p>In the literature, there is a substantial lack of studies that consider graph representations
in the context of deep learning for symbolic music. The VAE-based performance renderer in
[20] and the cadence detector in [22] are, to the best of our knowledge, the only systems that
use Deep Graph Networks to process musical graphs. For what concerns generation, the only
attempts at using graphs with deep learning are represented by PopMNet [20] and MELONS
[21]. Both works use graphs to condition the generation of monophonic music, which is carried
out by recurrent networks. In contrast to these works, our approach uses graphs at a lower level,
leveraging Deep Graph Networks to learn tonal and rhythmic representations in the context of
polyphonic multitrack music generation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Graph-based Music Generation</title>
      <p>The proposed model processes 4/4 polyphonic, multitrack fixed-length music sequences. Input
songs are assumed to be available as an  ×  ×  ×  multitrack pianoroll binary tensor, where
 is the number of bars,  the number of tracks,  the number of timesteps in a bar and  the
number of possible pitches. The number of timesteps in a bar,  , is fixed to 32. The division of
sequences into bars is crucial since the model treats diferent bars separately. An example of a
multitrack pianoroll is shown in Figure 1a.</p>
      <sec id="sec-3-1">
        <title>3.1. Graph-based Music Representation</title>
        <p>We propose to represent polyphonic multitrack music by a chord-level graph  = ( , ℰ ,  ,  ) ,
where  is the set of nodes, ℰ is the set of (multi-type) edges,  the set of edge features and 
the set of node features. An example of a chord-level graph is shown in Figure 1c.</p>
        <p>The structure  of  is represented by the sets  ,  and ℰ. Each node  ∈  corresponds to
the activation of a chord in a specific track and timestep. Notice that we use the term “chord”
loosely here to indicate any non-empty sequence of MIDI notes. We identify three types of
edges (,  ) ∈ ℰ : track edges, onset edges and next edges. Track edges connect nodes that
represent consecutive activations of a single track. Onset edges connect nodes that represent
simultaneous activations of diferent tracks. Finally, next edges connect nodes that represent
consecutive activations of diferent tracks in diferent timesteps. In order to model diferent
tracks, a separate track edge type is instantiated for each track. Each edge feature   ∈ 
contains the type of the edge (,  ) as well as the distance in timesteps between the two nodes.</p>
        <p>The content  of  is represented by the set of node features  . Node features   ∈  contain
the list of notes played in correspondence of node  . The number of maximum notes in a chord,
Σ, is fixed a priori. Each note is represented as a feature vector of dimension  . The vector
contains information about pitch and duration stored as a one-hot token pair. The pitch token
can assume 131 diferent values, which correspond to 128 MIDI pitches with the addition of
SOS , EOS and PAD tokens. Similarly, the duration token can assume 99 diferent values, which
correspond to 96 diferent durations (yielding a maximum duration of 3 bars) with the addition
of SOS , EOS and PAD tokens.</p>
        <p>The structure of  is encoded by the tensor S ∈ {0, 1} × × . S,, = 1 if and only if there is
an activation of at least one note in the track  at timestep  of the  -th bar. Intuitively, S,,
indicates whether track  is active (not counting the sustain of notes) at timestep  in the  -th
bar. An example of a structure tensor is shown in Figure 1b. The content of a chord-level graph,
on the other hand, can be encoded through a tensor X ∈ ℝ| |×Σ× after fixing an ordering of  .</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Deep Graph Network for Music</title>
        <p>
          Our graph-based representation of music is processed by a deep VAE [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] that reconstructs
the structure  and the content  of a chord-level graph  = ( ,  ) . Its encoder models
the encoding distribution   ( | ,  ) , where  ∈ ℝ . The decoder network, on the other hand,
models   ( ,  |  ). After introducing the latent variables   ∈ ℝ and   ∈ ℝ , the generative
process can be formalized as follows:
  ( ,  ,
  ,   | ) =   (  | )  (  | )  ( |   )  ( |   ,  )
(1)
        </p>
        <p>
          A high-level representation of the model is shown in Figure 2. The encoder consists of
two separate submodules, namely a content encoder and a structure encoder which output,
respectively, the codes   and   . The two codes are finally combined into a graph code   with
a linear layer. The decoder, on the other hand, generates the structure  and the content  of
 one after the other. First, symmetrically to the encoder, it decomposes  into two separate
latent vectors   and   through a linear layer. Then, it generates  from   through a structure
decoder and the content  from  and   through a deep graph content decoder. The content
and the structure decoder model, respectively, the distributions   ( |  ) and   ( | ,  ).
Content Encoder. In the content encoder (Figure 3a) each note is first embedded into a
 -dimensional space with a linear note encoder. Next, a linear chord encoder processes the list
of notes associated to each node, producing  -dimensional chord representations. These chord
representations are processed by an encoder Graph Convolutional Network (GCN) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] with 
layers. We refer the reader to the supplementary material1 for details about the implementation
1https://emanuelecosenza.github.io/polyphemus/assets/suppmaterials.pdf
of the GCN. After  graph convolutional layers, a soft attention readout layer similar to that
in [23] aggregates the information contained in each subgraph   of  related to the  -th bar
passed through a linear bar compressor to obtain the final content representation   .
of the musical sequence, producing bar embeddings  1 , … ,   . The bar embeddings are finally

Structure Encoder.
        </p>
        <p>The structure encoder (Figure 3b) takes as input the structure tensor
S ∈ ℝ × ×</p>
        <p>and computes the code   . This module first encodes each bar S ∈ ℝ × into a latent
representation   ∈ ℝ through a CNN [24] made of two convolutional layers with ReLU
activations and Batch Normalization, interleaved by max pooling. The bar representations  1 , … ,  

are then computed by passing the signal through two dense layers. These representations are
ifnally concatenated and passed through a linear layer to obtain   .</p>
        <p>Structure Decoder. The structure decoder (see Figure 4b) is specular to the structure encoder.
It first decompresses 
into a structure tensor S ∈ ℝ ×
 into 
structure bar representations  1 , … ,   and decodes each of them</p>
        <p>with a bar decoder. The bar decoder mirrors the bar encoder,
with the diference that upsample layers are interleaved with convolutional layers to obtain the
original resolution of the pianoroll. Finally, a sigmoid layer produces probability values which
are stacked to form the probabilistic structure tensor S.̃
(a)
(b)</p>
        <p>Content Decoder. It reconstructs the content of  from   and  . The decoder first
decompresses   into  1 , … ,   ∈ ℝ . Each   is used to initialize the states of the nodes in

the subgraph   , which represents the connected component related to the  -th bar of the
structure  . From there, a GCN identical to the one employed in the encoder computes the
ifnal states
ifnal node state</p>
        <p>h
 ∈ ℝ for each node  . At this point, a linear chord decoder transforms each
h into the corresponding Σ note representations of dimension  . Such note
representations are decoded and passed through a softmax layer, which outputs two separate
probability distributions over pitches and durations, yielding the probabilistic tensors P̃ and D̃,
which contain, respectively, pitch and duration probabilities.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Training</title>
        <p>The model is trained to minimize the following loss
ℒ () = [−
log   (|  )] +   (  ( |) ||  (
0,  )),
where</p>
        <p>(⋅||⋅) is the KL divergence and the expectation is taken with respect to  ∼   ( | ).
Following the  -VAE framework [25], the hyperparameter  controls the trade-of between
reconstruction accuracy and latent space regularization.</p>
        <p>Since the generative process is divided in two parts, the log-likelihood term in Equation 2
can be decomposed as follows:</p>
        <p>The first term in Equation 3 can be derived in the following way:
log   (|  ) = log (  ( |  )  ( |  ,  ) )</p>
        <p>= log   ( |  ) + log   ( |  ,  ).
log   ( |  ) = ∑ S</p>
        <p>,, log S̃,, +
,,
+ (1 − S,, ) log(1 − S̃,, ),
where independence is assumed between variables.</p>
        <p>Computing the content log-likelihood in Equation 3 is trickier, since the structure generated
by the structure decoder may be diferent from the real one. We circumvent this problem by
using a form of teacher forcing, where the content is obtained by filling the real structure in
place of the one generated by the structure decoder. In this way, the following likelihood can
always be computed:
log   ( |  ,  ) =
∑ ∑ log(P̃, ) P</p>
        <p>, +
 
+ log(D̃, )
 D, ,
where P and D are tensors containing, respectively, real one-hot pitch and duration tokens,
while P̃ and D̃ represent their probabilistic reconstructions. Indepedence is assumed between
all pitch and duration variables.
(2)
(3)
(4)
(5)</p>
        <sec id="sec-3-3-1">
          <title>Datasets</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>2 Bars</title>
          <p>LMD-matched 6,813,946
MetaMIDI Dataset 11,076,635</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref10 ref8">8, 26, 10</xref>
        ], we experiment on short and long sequences of MIDI music. The
experiments probe the generative capabilities of the model comparing, whenever possible, to state of
the art approaches. We refer the reader to the source code2 and the additional material3, which
contains the audio samples produced in the experimental phase.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Data and Experimental Setup</title>
        <p>
          We use the ‘LMD-matched’ version of the Lakh MIDI Dataset [27], which contains a total of
45,129 MIDI songs. We also consider the more challenging MetaMIDI Dataset (MMD) [28], an
unexplored large scale MIDI collection totalling 436,631 songs. For each dataset, we obtain two
new datasets containing 2-bar and 16-bar sequences represented as chord-level graphs. The
preprocessing pipeline is similar to that in [
          <xref ref-type="bibr" rid="ref10 ref8">8, 26, 10</xref>
          ]. The details about preprocessing can be
found in the supplementary material. Each preprocessed sequence is composed of 4 tracks: a
drum track, a bass track, a guitar/piano track and a strings track. The sizes of the resulting
datasets are shown in Table 1.
        </p>
        <p>The experiments focus on two versions of the model, one for 2-bar sequences and one for
16-bar sequences. We use for both a 70/10/20 split. The number of layers  of the GCNs is
ifxed to 8. The value  is set to 512. Adam [29] is used as the optimizer for both models, setting
1e-4 and 5e-5 as initial learning rates for the 2-bar and the 16-bar models. The learning rates
are decayed exponentially after 8000 gradient updates with a decay factor of 1 − 5e-6. The
hyperparameter  is annealed from 0 to 0.01 during training.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Generation</title>
        <p>
          The first set of experiments concerns the analysis of sequences generated from random codes
 . In the manual qualitative analysis4, both the 2-bar and the 16-bar models appear to be
particularly consistent, producing reasonable chord progressions, melodic segments and drum
patterns. To provide a more quantitative assessment, following previous works [
          <xref ref-type="bibr" rid="ref10">30, 10</xref>
          ], we
measure the generative ability of the trained models by computing the following metrics on
20,000 generated sequences:
        </p>
        <p>• EB (Empty Bars): ratio of empty bars.
2https://github.com/EmanueleCosenza/polyphemus
3https://emanuelecosenza.github.io/polyphemus/
4Audio samples of generated 2-bar and 16-bar samples can be found here: https://emanuelecosenza.github.io/
polyphemus/generation.html</p>
        <p>LMD-matched
MetaMIDI Dataset
jamming
composer
hybrid
Calliope
Ours (2-bars)
Ours (16-bars)
Ours (2-bars)
Ours (16-bars)</p>
        <p>D</p>
        <p>G/P
20.45
Generation metrics of the proposed model, Calliope and the jamming, composer and hybrid versions of
MuseGAN (EB: empty bars (%), UPC: number of used pitch classes, DP: drum patterns (%), D: drums, B:
bass, G/P: guitar/piano, S: strings).</p>
        <p>
          • UPC (Used Pitch Classes): number of used pitch classes (12) per bar.
• DP (Drum Patterns): ratio of notes in 16-beat patterns, which are common in popular
music (in %).
versions of MuseGAN [30] and Calliope [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. We also include metrics for the models trained on
MetaMIDI Dataset with the goal of stimulating research on larger MIDI collections. The EB
values are never equal to zero, which indicates that there are no issues with holes in the latent
space and that the models do not ignore the latent codes during decoding. The UPC values
are consistently low, indicating that the models have learned to stick to specific tonalities in
the context of single bars. Additionally, the DP values for the proposed model are the highest,
confirming its consistency on the rhythmic level. These results further validate the proposed
methodology and confirm the rhythmic and tonal coherence of the model.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Structure-conditioned Generation</title>
        <p>The separation of structure and content in our approach allows for the replacement of the
generated structure tensor S with a new tensor Ŝ during the decoding process. This new tensor
can be modified in a similar fashion to pianoroll editing in Digital Audio Workstations (DAW).
For instance, the user can specify that a certain instrument should only be played at a specific
time in the sequence by filling the desired positions in the binary activation grid. To show
this, we operate as follows, focusing on the 2-bar model trained on the Lahk MIDI Dataset.
We start by sampling a random latent code  , from which we obtain the two representations

 and   . We then let the structure decoder produce the corresponding structure tensor S
from   . At this point, we modify S to our liking, obtaining a new structure tensor Ŝ. This
corresponds to adding or removing nodes from the chord-level graph being generated. Finally,
we let the content decoder compute two separate content tensors X and X̂, corresponding to
two final music sequences. For our purposes, the content decoder should be robust to changes
in the structure, replicating the same musical content represented by   . When listening to the
audio samples generated in this way, the model appears to be able to preserve the rhythmic
and tonal features of the original sequence, rearranging the musical content while abiding by
the imposed structure. As an example, Figure 5a shows a generated structure tensor S. The
resulting sequence contains a recognizable I-IV progression in the key of B, supported by 8-beat
bass and drum patterns5. We edit the tensor by making the drums sparser, keeping only the
nodes at the start of each beat, and by making the strings more active, adding new nodes at the
start of beats. This yields a new structure tensor Ŝ, which is shown in Figure 5b. The resulting
music produced by the content decoder maintains the same harmonic progression of the original
sequence. The bass and guitar tracks remain unaltered with very slight variations. Finally, the
strings play a new melodic line in the right key, while the drums play a steady 4-beat hi-hat
pattern. Overall, this shows that the content decoder can adapt to new structures specified by
the user, opening the door to a new form of human-computer music co-creation.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Embedding Visualization</title>
        <p>Similarly to [31] we explore the pitch, duration and chord embeddings by visualizing their
principal components, focusing on the encoder network of the 2-bar model trained on the Lahk
MIDI Dataset. Figure 6a shows the PCA projection in 3D space of all the 128 pitch embeddings.
Pitch projections follow a circular path along the clockwise direction, suggesting that the
model has learned the tonal relationships between diferent pitches. Figure 6b shows a 3D
PCA projection of chord embeddings considering every major chord obtained by picking as
roots the notes between C1 and B8. Durations are fixed to 1 beat. Similarly to what happens
for pitches, chord embeddings follow a circular path in the space and form clusters related to
specific octaves.</p>
        <p>Figure 7 shows the PCA projections in 2D space of duration embeddings considering,
respectively, all the possible 96 durations (i.e. up to three bars) and the first 32 durations (i.e. up to a
bar). In the first case (Figure 7a), two distinct clusters contain, respectively, durations above
64 (i.e. above 2 bars) and durations below 64 (i.e. below 2 bars). In the second plot (Figure 7b),
5This and other examples related to conditioned generation can be found here: https://emanuelecosenza.github.io/
polyphemus/conditioned-generation.html
three clusters can be identified with, respectively, durations below 16 (i.e. below 2 beats, left
of the plot), durations between 16 and 24 (i.e. between 2 and 3 beats, upper-right of the plot)
and durations above 24 (i.e. between 3 beats and a bar). The plots suggest that the model has
learned to organize its duration space in accordance to the rhythmic concepts of beats and bars.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this work, we introduced a new graph representation for polyphonic multitrack music and a
model that generates musical graphs by separating their structure and content. As seen in the
qualitative analysis and the comparison with the state of the art, our approach has revealed
to be beneficial with regards to the rhythmic and tonal consistency of the generated music.
Through manual experiments, we showed that our methodology enables a generative scenario
where users can specify the activity of particular instruments in a music sequence. Finally, we
further validated our work by visualizing the pitch, chord and duration embeddings learned by
the model. In each case, the embedding spaces are organized in accordance with known tonal
and rhythmic concepts. To conclude, we believe that the model has the potential to support
human-computer co-creation, and it will be interesting to find possible applications of our
methodology in modern software audio tools.
[18] R. Mellon, D. Spaeth, E. Theis, Genre classification using graph representations of music
(2014).
[19] F. Simonetta, F. Carnovalini, N. Orio, A. Rodà, Symbolic music similarity through a
graphbased representation, in: Proceedings of the Audio Mostly 2018 on Sound in Immersion
and Emotion, 2018, pp. 1–7.
[20] J. Wu, X. Liu, X. Hu, J. Zhu, Popmnet: Generating structured pop music melodies using
neural networks, Artificial Intelligence 286 (2020) 103303.
[21] Y. Zou, P. Zou, Y. Zhao, K. Zhang, R. Zhang, X. Wang, Melons: generating melody with
longterm structure using transformers and structure graph, arXiv preprint arXiv:2110.05020
(2021).
[22] E. Karystinaios, G. Widmer, Cadence detection in symbolic classical music using graph
neural networks, 2022. arXiv:2208.14819.
[23] D. Jeong, T. Kwon, Y. Kim, J. Nam, Graph neural network for music score data and modeling
expressive piano performance, in: International Conference on Machine Learning, PMLR,
2019, pp. 3060–3070.
[24] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. http://www.</p>
      <p>deeplearningbook.org.
[25] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, A. Lerchner,
beta-vae: Learning basic visual concepts with a constrained variational framework (2016).
[26] A. Valenti, A. Carta, D. Bacciu, Learning style-aware symbolic music representations by
adversarial autoencoders, arXiv preprint arXiv:2001.05494 (2020).
[27] C. Rafel, Learning-based methods for comparing sequences, with applications to
audio-tomidi alignment and matching, Columbia University, 2016.
[28] J. Ens, P. Pasquier, Building the metamidi dataset: Linking symbolic and audio musical
data., in: ISMIR, 2021, pp. 182–188.
[29] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980 (2014).
[30] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, Y.-H. Yang, Musegan: Multi-track sequential
generative adversarial networks for symbolic music generation and accompaniment, in:
Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
[31] Z. Wang, Y. Zhang, Y. Zhang, J. Jiang, R. Yang, J. Zhao, G. Xia, Pianotree vae: Structured
representation learning for polyphonic music, arXiv preprint arXiv:2008.07118 (2020).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Hierarchical text-conditional image generation with clip latents</article-title>
          ,
          <source>arXiv preprint arXiv:2204.06125</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agostinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. I.</given-names>
            <surname>Denk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Borsos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Engel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Verzetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tagliasacchi</surname>
          </string-name>
          , et al.,
          <article-title>Musiclm: Generating music from text</article-title>
          ,
          <source>arXiv preprint arXiv:2301.11325</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Auto-encoding variational bayes</article-title>
          ,
          <source>arXiv preprint arXiv:1312.6114</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Errica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Micheli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Podda</surname>
          </string-name>
          ,
          <article-title>A gentle introduction to deep learning for graphs</article-title>
          ,
          <source>Neural Networks</source>
          <volume>129</volume>
          (
          <year>2020</year>
          )
          <fpage>203</fpage>
          -
          <lpage>221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          ,
          <article-title>Song from pi: A musically plausible network for pop music generation</article-title>
          ,
          <source>arXiv preprint arXiv:1611.03477</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brunner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wattenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wiesendanger</surname>
          </string-name>
          ,
          <article-title>Jambot: Music theory aware chord based generation of polyphonic music with lstms</article-title>
          ,
          <source>in: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>519</fpage>
          -
          <lpage>526</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Engel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hawthorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eck</surname>
          </string-name>
          ,
          <article-title>A hierarchical latent vector model for learning long-term structure in music</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4364</fpage>
          -
          <lpage>4373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>C.-Z. A. Huang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Shazeer</surname>
            , I. Simon,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hawthorne</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          <string-name>
            <surname>Hofman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dinculescu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Eck</surname>
          </string-name>
          , Music transformer, arXiv preprint arXiv:
          <year>1809</year>
          .
          <volume>04281</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Valenti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Berti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <article-title>Calliope-a polyphonic music transformer</article-title>
          ,
          <source>arXiv preprint arXiv:2107.05546</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>C.-H. Chuan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Herremans</surname>
          </string-name>
          ,
          <article-title>Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>32</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>C.-Z. A. Huang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Cooijmans</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Eck</surname>
          </string-name>
          , Counterpoint by convolution, arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>07227</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pouget-Abadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Generative adversarial nets,
          <source>Advances in neural information processing systems</source>
          <volume>27</volume>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Makhzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jaitly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Frey</surname>
          </string-name>
          , Adversarial autoencoders,
          <source>arXiv preprint arXiv:1511.05644</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. T.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Small</surname>
          </string-name>
          ,
          <article-title>Complex network structure of musical compositions: Algorithmic generation of appealing music</article-title>
          ,
          <source>Physica A: Statistical Mechanics and its Applications</source>
          <volume>389</volume>
          (
          <year>2010</year>
          )
          <fpage>126</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ferretti</surname>
          </string-name>
          ,
          <article-title>On the complex network structure of musical pieces: analysis of some use cases from diferent music genres</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>77</volume>
          (
          <year>2018</year>
          )
          <fpage>16003</fpage>
          -
          <lpage>16029</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ferretti</surname>
          </string-name>
          ,
          <article-title>On the modeling of musical solos as complex networks</article-title>
          ,
          <source>Information Sciences 375</source>
          (
          <year>2017</year>
          )
          <fpage>271</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>