1. Introduction

Emanuele Cosenza

e.cosenza3@studenti.unipi.it

Andrea Valenti

andrea.valenti@phd.unipi.it

Davide Bacciu

davide.bacciu@unipi.it

2023

Graphs can be leveraged to model polyphonic multitrack symbolic music, where notes, chords and entire sections may be linked at diferent levels of the musical hierarchy by tonal and rhythmic relationships. Nonetheless, there is a lack of works that consider graph representations in the context of deep learning systems for music generation. This paper bridges this gap by introducing a novel graph representation for music and a deep Variational Autoencoder that generates the structure and the content of musical graphs separately, one after the other, with a hierarchical architecture that matches the structural priors of music. By separating the structure and content of musical graphs, it is possible to condition generation by specifying which instruments are played at certain times. This opens the door to a new form of human-computer interaction in the context of music co-creation. After training the model on existing MIDI datasets, the experiments show that the model is able to generate appealing short and long musical sequences and to realistically interpolate between them, producing music that is tonally and rhythmically consistent. Finally, the visualization of the embeddings shows that the model is able to organize its latent space in accordance with known musical concepts.

Symbolic music generation Variational Autoencoders Deep Graph Networks

1. Introduction

The automatic generation of artistic artifacts is gathering increasing interest, also thanks to the possibilities ofered by modern deep generative models [ 1, 2, 3 ]. Despite these achievements, a closer inspection is often enough to detect if a piece of art is the outcome of an automatic artificial process. While being very good at approximating the external appearance of the artworks, artificial models still lack a way to convey an artistic message to the overall experience. This results in artworks that are convincing but soulless, lacking a general coherence and a deeper meaning. This is particularly true in the case of music, where the artist needs to be very aware of the emotions evoked by a particular sequence of notes in order to stimulate a specific mood in the listener.

A way to circumvent the above issues is to look at deep learning models as a support to the human artist, instead of as a replacement. The models can thus be used as a way to automatize the low-level routine sub-tasks of the creative process, while leaving the artist free to concentrate on the overall picture.

In this paper, we introduce a new model for the automatic generation of symbolic sequences of multitrack, polyphonic music. The generation process is carried out through the use of a novel graph-based internal representation, which allows to explicitly model the diferent chords in the song and the relations between them. This representation allows the human artist to perform controlled changes to the output of the neural network in order to control specific aspects of the artistic performance, while leaving the model free to generate the remaining part in a coherent way.

The main contributions of this paper are the following. First, we propose a novel graph representation of multitrack, polyphonic music, where nodes represent the chords played by diferent instruments and edges model the relationships between them. Second, we introduce a deep Variational Autoencoder [ 4 ] that generates musical graphs by separating their rhythmic structure and tonal content. To the best of our knowledge, this is the first time in literature that Deep Graph Networks [ 5 ] are used to generate multitrack, polyphonic music. Finally, we show a new generative scenario enabled by our approach in which the user can intuitively condition generation by specifying which instruments have to be played at specific timesteps.

2. Related Works

In recent years, there have been many attempts at generating symbolic music with deep learning architectures such as LSTMs [ 6, 7, 8 ], Transformers [ 9, 10 ] and CNNs [ 11, 12 ], using Variational Autoencoders (VAE) [ 4 ], Generative Adversarial Networks (GAN) [ 13 ] and Adversarial Autoencoders (AAE) [ 14 ] as the main generative frameworks.

One of the challenges in devising symbolic generators is choosing an appropriate representation for music data. Researchers have therefore started to experiment with graph-based representations, where musical entities and their relationships are modeled, respectively, by nodes and edges. Musical graphs have been built at the note level [ 15, 16, 17, 18 ], associating nodes to notes and edges to temporal or tonal relationships, as well as at a higher level of the hierarchy, using melodic segments [19] and bars [20, 21] as building blocks.

In the literature, there is a substantial lack of studies that consider graph representations in the context of deep learning for symbolic music. The VAE-based performance renderer in [20] and the cadence detector in [22] are, to the best of our knowledge, the only systems that use Deep Graph Networks to process musical graphs. For what concerns generation, the only attempts at using graphs with deep learning are represented by PopMNet [20] and MELONS [21]. Both works use graphs to condition the generation of monophonic music, which is carried out by recurrent networks. In contrast to these works, our approach uses graphs at a lower level, leveraging Deep Graph Networks to learn tonal and rhythmic representations in the context of polyphonic multitrack music generation.

3. Graph-based Music Generation

The proposed model processes 4/4 polyphonic, multitrack fixed-length music sequences. Input songs are assumed to be available as an × × × multitrack pianoroll binary tensor, where is the number of bars, the number of tracks, the number of timesteps in a bar and the number of possible pitches. The number of timesteps in a bar, , is fixed to 32. The division of sequences into bars is crucial since the model treats diferent bars separately. An example of a multitrack pianoroll is shown in Figure 1a.

3.1. Graph-based Music Representation

We propose to represent polyphonic multitrack music by a chord-level graph = ( , ℰ , , ) , where is the set of nodes, ℰ is the set of (multi-type) edges, the set of edge features and the set of node features. An example of a chord-level graph is shown in Figure 1c.

The structure of is represented by the sets , and ℰ. Each node ∈ corresponds to the activation of a chord in a specific track and timestep. Notice that we use the term “chord” loosely here to indicate any non-empty sequence of MIDI notes. We identify three types of edges (, ) ∈ ℰ : track edges, onset edges and next edges. Track edges connect nodes that represent consecutive activations of a single track. Onset edges connect nodes that represent simultaneous activations of diferent tracks. Finally, next edges connect nodes that represent consecutive activations of diferent tracks in diferent timesteps. In order to model diferent tracks, a separate track edge type is instantiated for each track. Each edge feature ∈ contains the type of the edge (, ) as well as the distance in timesteps between the two nodes.

The content of is represented by the set of node features . Node features ∈ contain the list of notes played in correspondence of node . The number of maximum notes in a chord, Σ, is fixed a priori. Each note is represented as a feature vector of dimension . The vector contains information about pitch and duration stored as a one-hot token pair. The pitch token can assume 131 diferent values, which correspond to 128 MIDI pitches with the addition of SOS , EOS and PAD tokens. Similarly, the duration token can assume 99 diferent values, which correspond to 96 diferent durations (yielding a maximum duration of 3 bars) with the addition of SOS , EOS and PAD tokens.

The structure of is encoded by the tensor S ∈ {0, 1} × × . S,, = 1 if and only if there is an activation of at least one note in the track at timestep of the -th bar. Intuitively, S,, indicates whether track is active (not counting the sustain of notes) at timestep in the -th bar. An example of a structure tensor is shown in Figure 1b. The content of a chord-level graph, on the other hand, can be encoded through a tensor X ∈ ℝ| |×Σ× after fixing an ordering of .

3.2. Deep Graph Network for Music

Our graph-based representation of music is processed by a deep VAE [ 4 ] that reconstructs the structure and the content of a chord-level graph = ( , ) . Its encoder models the encoding distribution ( | , ) , where ∈ ℝ . The decoder network, on the other hand, models ( , | ). After introducing the latent variables ∈ ℝ and ∈ ℝ , the generative process can be formalized as follows: ( , , , | ) = ( | ) ( | ) ( | ) ( | , ) (1)

A high-level representation of the model is shown in Figure 2. The encoder consists of two separate submodules, namely a content encoder and a structure encoder which output, respectively, the codes and . The two codes are finally combined into a graph code with a linear layer. The decoder, on the other hand, generates the structure and the content of one after the other. First, symmetrically to the encoder, it decomposes into two separate latent vectors and through a linear layer. Then, it generates from through a structure decoder and the content from and through a deep graph content decoder. The content and the structure decoder model, respectively, the distributions ( | ) and ( | , ). Content Encoder. In the content encoder (Figure 3a) each note is first embedded into a -dimensional space with a linear note encoder. Next, a linear chord encoder processes the list of notes associated to each node, producing -dimensional chord representations. These chord representations are processed by an encoder Graph Convolutional Network (GCN) [ 5 ] with layers. We refer the reader to the supplementary material1 for details about the implementation 1https://emanuelecosenza.github.io/polyphemus/assets/suppmaterials.pdf of the GCN. After graph convolutional layers, a soft attention readout layer similar to that in [23] aggregates the information contained in each subgraph of related to the -th bar passed through a linear bar compressor to obtain the final content representation . of the musical sequence, producing bar embeddings 1 , … , . The bar embeddings are finally Structure Encoder.

The structure encoder (Figure 3b) takes as input the structure tensor S ∈ ℝ × ×

and computes the code . This module first encodes each bar S ∈ ℝ × into a latent representation ∈ ℝ through a CNN [24] made of two convolutional layers with ReLU activations and Batch Normalization, interleaved by max pooling. The bar representations 1 , … , are then computed by passing the signal through two dense layers. These representations are ifnally concatenated and passed through a linear layer to obtain .

Structure Decoder. The structure decoder (see Figure 4b) is specular to the structure encoder. It first decompresses into a structure tensor S ∈ ℝ × into structure bar representations 1 , … , and decodes each of them

with a bar decoder. The bar decoder mirrors the bar encoder, with the diference that upsample layers are interleaved with convolutional layers to obtain the original resolution of the pianoroll. Finally, a sigmoid layer produces probability values which are stacked to form the probabilistic structure tensor S.̃ (a) (b)

Content Decoder. It reconstructs the content of from and . The decoder first decompresses into 1 , … , ∈ ℝ . Each is used to initialize the states of the nodes in the subgraph , which represents the connected component related to the -th bar of the structure . From there, a GCN identical to the one employed in the encoder computes the ifnal states ifnal node state

h ∈ ℝ for each node . At this point, a linear chord decoder transforms each h into the corresponding Σ note representations of dimension . Such note representations are decoded and passed through a softmax layer, which outputs two separate probability distributions over pitches and durations, yielding the probabilistic tensors P̃ and D̃, which contain, respectively, pitch and duration probabilities.

3.3. Training

The model is trained to minimize the following loss ℒ () = [− log (| )] + ( ( |) || ( 0, )), where

(⋅||⋅) is the KL divergence and the expectation is taken with respect to ∼ ( | ). Following the -VAE framework [25], the hyperparameter controls the trade-of between reconstruction accuracy and latent space regularization.

Since the generative process is divided in two parts, the log-likelihood term in Equation 2 can be decomposed as follows:

The first term in Equation 3 can be derived in the following way: log (| ) = log ( ( | ) ( | , ) )

= log ( | ) + log ( | , ). log ( | ) = ∑ S

,, log S̃,, + ,, + (1 − S,, ) log(1 − S̃,, ), where independence is assumed between variables.

Computing the content log-likelihood in Equation 3 is trickier, since the structure generated by the structure decoder may be diferent from the real one. We circumvent this problem by using a form of teacher forcing, where the content is obtained by filling the real structure in place of the one generated by the structure decoder. In this way, the following likelihood can always be computed: log ( | , ) = ∑ ∑ log(P̃, ) P

, + + log(D̃, ) D, , where P and D are tensors containing, respectively, real one-hot pitch and duration tokens, while P̃ and D̃ represent their probabilistic reconstructions. Indepedence is assumed between all pitch and duration variables. (2) (3) (4) (5)

Datasets 2 Bars

LMD-matched 6,813,946 MetaMIDI Dataset 11,076,635

4. Experiments

Following [ 8, 26, 10 ], we experiment on short and long sequences of MIDI music. The experiments probe the generative capabilities of the model comparing, whenever possible, to state of the art approaches. We refer the reader to the source code2 and the additional material3, which contains the audio samples produced in the experimental phase.

4.1. Data and Experimental Setup

We use the ‘LMD-matched’ version of the Lakh MIDI Dataset [27], which contains a total of 45,129 MIDI songs. We also consider the more challenging MetaMIDI Dataset (MMD) [28], an unexplored large scale MIDI collection totalling 436,631 songs. For each dataset, we obtain two new datasets containing 2-bar and 16-bar sequences represented as chord-level graphs. The preprocessing pipeline is similar to that in [ 8, 26, 10 ]. The details about preprocessing can be found in the supplementary material. Each preprocessed sequence is composed of 4 tracks: a drum track, a bass track, a guitar/piano track and a strings track. The sizes of the resulting datasets are shown in Table 1.

The experiments focus on two versions of the model, one for 2-bar sequences and one for 16-bar sequences. We use for both a 70/10/20 split. The number of layers of the GCNs is ifxed to 8. The value is set to 512. Adam [29] is used as the optimizer for both models, setting 1e-4 and 5e-5 as initial learning rates for the 2-bar and the 16-bar models. The learning rates are decayed exponentially after 8000 gradient updates with a decay factor of 1 − 5e-6. The hyperparameter is annealed from 0 to 0.01 during training.

4.2. Generation

The first set of experiments concerns the analysis of sequences generated from random codes . In the manual qualitative analysis4, both the 2-bar and the 16-bar models appear to be particularly consistent, producing reasonable chord progressions, melodic segments and drum patterns. To provide a more quantitative assessment, following previous works [ 30, 10 ], we measure the generative ability of the trained models by computing the following metrics on 20,000 generated sequences:

• EB (Empty Bars): ratio of empty bars. 2https://github.com/EmanueleCosenza/polyphemus 3https://emanuelecosenza.github.io/polyphemus/ 4Audio samples of generated 2-bar and 16-bar samples can be found here: https://emanuelecosenza.github.io/ polyphemus/generation.html

LMD-matched MetaMIDI Dataset jamming composer hybrid Calliope Ours (2-bars) Ours (16-bars) Ours (2-bars) Ours (16-bars)

G/P 20.45 Generation metrics of the proposed model, Calliope and the jamming, composer and hybrid versions of MuseGAN (EB: empty bars (%), UPC: number of used pitch classes, DP: drum patterns (%), D: drums, B: bass, G/P: guitar/piano, S: strings).

• UPC (Used Pitch Classes): number of used pitch classes (12) per bar. • DP (Drum Patterns): ratio of notes in 16-beat patterns, which are common in popular music (in %). versions of MuseGAN [30] and Calliope [ 10 ]. We also include metrics for the models trained on MetaMIDI Dataset with the goal of stimulating research on larger MIDI collections. The EB values are never equal to zero, which indicates that there are no issues with holes in the latent space and that the models do not ignore the latent codes during decoding. The UPC values are consistently low, indicating that the models have learned to stick to specific tonalities in the context of single bars. Additionally, the DP values for the proposed model are the highest, confirming its consistency on the rhythmic level. These results further validate the proposed methodology and confirm the rhythmic and tonal coherence of the model.

4.3. Structure-conditioned Generation

The separation of structure and content in our approach allows for the replacement of the generated structure tensor S with a new tensor Ŝ during the decoding process. This new tensor can be modified in a similar fashion to pianoroll editing in Digital Audio Workstations (DAW). For instance, the user can specify that a certain instrument should only be played at a specific time in the sequence by filling the desired positions in the binary activation grid. To show this, we operate as follows, focusing on the 2-bar model trained on the Lahk MIDI Dataset. We start by sampling a random latent code , from which we obtain the two representations and . We then let the structure decoder produce the corresponding structure tensor S from . At this point, we modify S to our liking, obtaining a new structure tensor Ŝ. This corresponds to adding or removing nodes from the chord-level graph being generated. Finally, we let the content decoder compute two separate content tensors X and X̂, corresponding to two final music sequences. For our purposes, the content decoder should be robust to changes in the structure, replicating the same musical content represented by . When listening to the audio samples generated in this way, the model appears to be able to preserve the rhythmic and tonal features of the original sequence, rearranging the musical content while abiding by the imposed structure. As an example, Figure 5a shows a generated structure tensor S. The resulting sequence contains a recognizable I-IV progression in the key of B, supported by 8-beat bass and drum patterns5. We edit the tensor by making the drums sparser, keeping only the nodes at the start of each beat, and by making the strings more active, adding new nodes at the start of beats. This yields a new structure tensor Ŝ, which is shown in Figure 5b. The resulting music produced by the content decoder maintains the same harmonic progression of the original sequence. The bass and guitar tracks remain unaltered with very slight variations. Finally, the strings play a new melodic line in the right key, while the drums play a steady 4-beat hi-hat pattern. Overall, this shows that the content decoder can adapt to new structures specified by the user, opening the door to a new form of human-computer music co-creation.

4.4. Embedding Visualization

Similarly to [31] we explore the pitch, duration and chord embeddings by visualizing their principal components, focusing on the encoder network of the 2-bar model trained on the Lahk MIDI Dataset. Figure 6a shows the PCA projection in 3D space of all the 128 pitch embeddings. Pitch projections follow a circular path along the clockwise direction, suggesting that the model has learned the tonal relationships between diferent pitches. Figure 6b shows a 3D PCA projection of chord embeddings considering every major chord obtained by picking as roots the notes between C1 and B8. Durations are fixed to 1 beat. Similarly to what happens for pitches, chord embeddings follow a circular path in the space and form clusters related to specific octaves.

Figure 7 shows the PCA projections in 2D space of duration embeddings considering, respectively, all the possible 96 durations (i.e. up to three bars) and the first 32 durations (i.e. up to a bar). In the first case (Figure 7a), two distinct clusters contain, respectively, durations above 64 (i.e. above 2 bars) and durations below 64 (i.e. below 2 bars). In the second plot (Figure 7b), 5This and other examples related to conditioned generation can be found here: https://emanuelecosenza.github.io/ polyphemus/conditioned-generation.html three clusters can be identified with, respectively, durations below 16 (i.e. below 2 beats, left of the plot), durations between 16 and 24 (i.e. between 2 and 3 beats, upper-right of the plot) and durations above 24 (i.e. between 3 beats and a bar). The plots suggest that the model has learned to organize its duration space in accordance to the rhythmic concepts of beats and bars.

5. Conclusions

In this work, we introduced a new graph representation for polyphonic multitrack music and a model that generates musical graphs by separating their structure and content. As seen in the qualitative analysis and the comparison with the state of the art, our approach has revealed to be beneficial with regards to the rhythmic and tonal consistency of the generated music. Through manual experiments, we showed that our methodology enables a generative scenario where users can specify the activity of particular instruments in a music sequence. Finally, we further validated our work by visualizing the pitch, chord and duration embeddings learned by the model. In each case, the embedding spaces are organized in accordance with known tonal and rhythmic concepts. To conclude, we believe that the model has the potential to support human-computer co-creation, and it will be interesting to find possible applications of our methodology in modern software audio tools. [18] R. Mellon, D. Spaeth, E. Theis, Genre classification using graph representations of music (2014). [19] F. Simonetta, F. Carnovalini, N. Orio, A. Rodà, Symbolic music similarity through a graphbased representation, in: Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion, 2018, pp. 1–7. [20] J. Wu, X. Liu, X. Hu, J. Zhu, Popmnet: Generating structured pop music melodies using neural networks, Artificial Intelligence 286 (2020) 103303. [21] Y. Zou, P. Zou, Y. Zhao, K. Zhang, R. Zhang, X. Wang, Melons: generating melody with longterm structure using transformers and structure graph, arXiv preprint arXiv:2110.05020 (2021). [22] E. Karystinaios, G. Widmer, Cadence detection in symbolic classical music using graph neural networks, 2022. arXiv:2208.14819. [23] D. Jeong, T. Kwon, Y. Kim, J. Nam, Graph neural network for music score data and modeling expressive piano performance, in: International Conference on Machine Learning, PMLR, 2019, pp. 3060–3070. [24] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. http://www.

deeplearningbook.org. [25] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, A. Lerchner, beta-vae: Learning basic visual concepts with a constrained variational framework (2016). [26] A. Valenti, A. Carta, D. Bacciu, Learning style-aware symbolic music representations by adversarial autoencoders, arXiv preprint arXiv:2001.05494 (2020). [27] C. Rafel, Learning-based methods for comparing sequences, with applications to audio-tomidi alignment and matching, Columbia University, 2016. [28] J. Ens, P. Pasquier, Building the metamidi dataset: Linking symbolic and audio musical data., in: ISMIR, 2021, pp. 182–188. [29] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [30] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, Y.-H. Yang, Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [31] Z. Wang, Y. Zhang, Y. Zhang, J. Jiang, R. Yang, J. Zhao, G. Xia, Pianotree vae: Structured representation learning for polyphonic music, arXiv preprint arXiv:2008.07118 (2020).

[1]

Ramesh ,

Dhariwal ,

Nichol ,

Chu ,

Chen , Hierarchical text-conditional image generation with clip latents , arXiv preprint arXiv:2204.06125 ( 2022 ).

[2]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[3]

Agostinelli ,

T. I.

Denk ,

Borsos ,

Engel ,

Verzetti ,

Caillon ,

Huang ,

Jansen ,

Roberts ,

Tagliasacchi , et al., Musiclm: Generating music from text , arXiv preprint arXiv:2301.11325 ( 2023 ).

[4]

D. P.

Kingma ,

Welling , Auto-encoding variational bayes , arXiv preprint arXiv:1312.6114 ( 2013 ).

[5]

Bacciu ,

Errica ,

Micheli ,

Podda , A gentle introduction to deep learning for graphs , Neural Networks 129 ( 2020 ) 203 - 221 .

[6]

Chu ,

Urtasun ,

Fidler , Song from pi: A musically plausible network for pop music generation , arXiv preprint arXiv:1611.03477 ( 2016 ).

[7]

Brunner ,

Wang ,

Wattenhofer ,

Wiesendanger , Jambot: Music theory aware chord based generation of polyphonic music with lstms , in: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI) , IEEE, 2017 , pp. 519 - 526 .

[8]

Roberts ,

Engel ,

Rafel ,

Hawthorne ,

Eck , A hierarchical latent vector model for learning long-term structure in music , in: International conference on machine learning, PMLR , 2018 , pp. 4364 - 4373 .

[9] C.-Z. A. Huang , A.

Vaswani , J.

Uszkoreit , N.

Shazeer , I. Simon, C.

Hawthorne , A. M.

Dai , M. D.

Hofman , M.

Dinculescu , D.

Eck , Music transformer, arXiv preprint arXiv: 1809 . 04281 ( 2018 ).

[10]

Valenti ,

Berti ,

Bacciu , Calliope-a polyphonic music transformer , arXiv preprint arXiv:2107.05546 ( 2021 ).

[11] C.-H. Chuan , D. Herremans , Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 32 , 2018 .

[12] C.-Z. A. Huang , T.

Cooijmans , A.

Roberts , A.

Courville , D.

Eck , Counterpoint by convolution, arXiv preprint arXiv: 1903 . 07227 ( 2019 ).

[13]

Goodfellow ,

Pouget-Abadie ,

Mirza ,

Xu ,

Warde-Farley ,

Ozair ,

Courville ,

Bengio , Generative adversarial nets, Advances in neural information processing systems 27 ( 2014 ).

[14]

Makhzani ,

Shlens ,

Jaitly ,

Goodfellow ,

Frey , Adversarial autoencoders, arXiv preprint arXiv:1511.05644 ( 2015 ).

[15]

X. F.

Liu ,

K. T.

Chi ,

Small , Complex network structure of musical compositions: Algorithmic generation of appealing music , Physica A: Statistical Mechanics and its Applications 389 ( 2010 ) 126 - 132 .

[16]

Ferretti , On the complex network structure of musical pieces: analysis of some use cases from diferent music genres , Multimedia Tools and Applications 77 ( 2018 ) 16003 - 16029 .

[17]

Ferretti , On the modeling of musical solos as complex networks , Information Sciences 375 ( 2017 ) 271 - 295 .