=Paper=
{{Paper
|id=Vol-2848/HAI-GEN-Paper-2
|storemode=property
|title=Latent Chords: Generative Piano Chord Synthesis with Variational Autoencoders
|pdfUrl=https://ceur-ws.org/Vol-2848/HAI-GEN-Paper-2.pdf
|volume=Vol-2848
|authors=Agustin Macaya,Rodrigo Cadiz,Manuel Cartagena,Denis Parra
|dblpUrl=https://dblp.org/rec/conf/iui/MacayaCCP20
}}
==Latent Chords: Generative Piano Chord Synthesis with Variational Autoencoders==
<pdf width="1500px">https://ceur-ws.org/Vol-2848/HAI-GEN-Paper-2.pdf</pdf>
<pre>
            Latent Chords: Generative Piano Chord Synthesis with
                          Variational Autoencoders
                           Agustín Macaya, Rodrigo F. Cádiz, Manuel Cartagena, Denis Parra∗
                                          {aamacaya,rcadiz,micartagena}@uc.cl,dparra@ing.puc.cl
                                                 Pontificia Universidad Católica de Chile
                                                             Santiago, Chile

ABSTRACT                                                                          It is no surprise then that the spectacular growth of DL has also
Advances in the latest years on neural generative models such as               greatly impacted the world of the arts. Classical tasks that can be
GANs and VAEs have unveiled a great potential for creative ap-                 addressed through DL are tasks that have to do with classification
plications supported by artificial intelligence methods. The most              and estimation of numerical quantities. But perhaps one of the
known applications have occurred in areas such as image synthesis              most interesting aspects that these networks can do now is the
for face generation as well as in natural language generation. In              generation of content. In particular, there are network architectures
terms of tools for music composition, several systems have been                that are capable of generating images, text or artistic content such as
released in the latest years, but there is still space for improving           paintings or music [2]. Different authors have designed and studied
the possibilities of music co-creation with neural generative tools.           networks capable of classifying music, recommending new music,
In this context, we introduce Latent Chords, a system based on a               learning the style of a visual work, among other things. Perhaps
Variational Autoencoder architecture which learns a latent space               one of the most relevant and recognized efforts at present is the
by reconstructing piano chords. We provide details of the neural               Magenta project 1 , carried out by Google Brain, one of the branches
architecture, the training process and we also show how Latent                 of the company in charge of using AI in its processes. According to
Chords can be used for a controllable exploration of chord sounds              their website, the goal of Magenta is to explore the role of machine
as well as to generate new chords by manipulating the latent repre-            learning as a tool in the creative process.
sentation. We make our training dataset, code and sound examples                  DL models have been proven useful even in very difficult com-
open and available at https://github.com/CreativAI-UC/TimbreNet                putational tasks, such as to solve reconstructions, deconvolutions
                                                                               and inverse problems with increasing accuracy over time [6, 12].
CCS CONCEPTS                                                                   However, this great capacity of neural networks for classification
                                                                               and regression is not what interests us the most. It has been shown
• Applied computing → Sound and music computing; • Com-
                                                                               that deep learning models can now generate very realistic visual
puting methodologies → Neural networks.
                                                                               or audible content, fooling even the most expert humans. In partic-
                                                                               ular, variational auto-encoders (VAEs) and generative adversarial
KEYWORDS                                                                       networks (GANs) have produced shocking results in the last couple
Visual Analytics, Explainable AI, Automated Machine Learning                   of years, as we discuss now.
ACM Reference Format:                                                             One of the most important motivations for using DL to generate
Agustín Macaya, Rodrigo F. Cádiz, Manuel Cartagena, Denis Parra. 2020.         musical content is its generality. As [2] emphasize: “As opposed
Latent Chords: Generative Piano Chord Synthesis with Variational Autoen-       to handcrafted models, such as grammar-based or rule-based music
coders . In IUI ’20 Workshops, March 17, 2020, Cagliari, Italy. ACM, New       generation systems, a machine learning-based generation system can
York, NY, USA, 4 pages.                                                        be agnostic, as it learns a model from an arbitrary corpus of music.
                                                                               As a result, the same system may be used for various musical genres.
1    INTRODUCTION                                                              Therefore, as more large scale musical datasets are made available, a
The promise of Deep Learning (DL) is to discover rich and hierarchi-           machine learning-based generation system will be able to automati-
cal models that represent probability data distributions encountered           cally learn a musical style from a corpus and to generate new musical
in artificial intelligence applications, such as natural images or au-         content”. In summary, as opposed to structured representations like
dio [6]. This potential of DL, when carefully analyzed, makes music            rules and grammars, DL excels at processing raw unstructured data,
and ideal application domain, being in essence very rich, structured           from which its hierarchy of layers will extract higher level repre-
and also hierarchical information encoded in either a symbolic                 sentations adapted to the task. We believe that this capacities make
score format or as audio waveforms.                                            DL a very interesting technique to be explored for the generation
                                                                               of novel musical content. Among all the potential tasks in music
∗ Also with IMFD.
                                                                               generation and composition which can be supported by DL models,
                                                                               in this work we focus on chord synthesis. In particular we leverage
Copyright © 2020 for this paper by its authors. Use permitted under Creative   Variational Autoencoders in order to learn a compressed latent
Commons License Attribution 4.0 International (CC BY 4.0).
                                                                               space which allows controlled exploration of piano chords as well
                                                                               as generation of new chords unobserved in the training dataset.


                                                                               1 https://magenta.tensorflow.org/
IUI ’20 Workshops, Cagliari, Italy,
                                                                                     Agustín Macaya, Rodrigo F. Cádiz, Manuel Cartagena, Denis Parra


   The contributions of this work are the following. First, we con-
structed a dataset of 450 chords recorded on the piano at different
levels of dynamics and pitch ranges (octaves). Second, we designed
a VAE which is very similar in architecture as the one described
in GanSynth [5], the difference being that they use a GAN while
we implemented a VAE. We chose a VAE architecture to decrease
the chance of problems such as training convergence and mode
collapse present in GANs [11, 13]. Third, we train our model in
such a way to obtain a two dimensional latent space that could
adequately represent all the information contained in the dataset.
Fourth, we explored this latent space in order to study how the
different families of chords were represented and how both dy-
namic and pitch content operate on this space. Finally, we explored
the generation of both new chords and harmonic trajectories by         Figure 1: Arquitecture of our VAE model for chord synthesis.
sampling points in this latent space.
                                                                       that can be of help in musical composition situations. For the spe-
                                                                       cific case of chords, there is a quite large number of research devoted
2    RELATED WORK                                                      to chord recognition (some notable examples are [3, 9, 12, 22]), but
Generative models have been extensively used for musical analysis      much less work has been devoted to chord generation. Our work is
and retrieval. We now discuss a few of the most relevant work with     based on GanSynth [5], a GAN model that can generate an entire
generative models from music from the last couple of years to get      audio clip from a single latent vector, allowing for a smooth control
an idea of the variety of applications that these techniques offer.    of features such as pitch and timbre. Our model, as we specify below,
    In terms of content generation, there are many recent works        works in a similar fashion but is was customized for the specific
that are very interesting. One of them is DeepBach [7], a neural       case of chord sequences.
network that is capable of harmonizing Bach-style chorals in a
very convincing way. MidiNet [21] is a convolutional adversary         3    NETWORK ARCHITECTURE
generation network that generates melodies in symbolic format          The network architecture is presented in Figure 1. Our design goal
(MIDI) by generating counterexamples from white noise. MuseGAN         was not only content generation and latent space exploration, but
[4] is network based on an adversary generation of symbolic mu-        also to generate a tool useful for musical composition. A VAE based
sic and accompaniment, specifically targeted for the rock genre.       model has the advantage over a GAN model of having an encoder
Wavenet [14] is a network that renders audio waves directly, with-     network that can accept inputs from the user and a decoder network
out going through any kind of musical representation. Wavenet          that can generate new outputs. Although it is possible to replicate
has been tested in human voice and speech. NSynth [5] is a kind        these features with a conditional GAN, we prefer using a VAE since
of timbre interpolation system that can create new types of very       GANs have known problems of training convergence and mode
convincing and expressive sounds, by morphing between differ-          collapse [11, 13] we prefer to avoid in this early stage of our project.
ent sound sources. In [19], the authors introduced a DL technique      Still, we based the encoder architecture from the discriminator of
to autonomously generate novel melodies that are variations of         GanSynth [5] and the decoder architecture from the generator of
an arbitrary base melody. They designed a neural network model         GanSynth.
that ensures, with high probability, that the melodic and rhythmic        The encoder takes a (128,1024,2) MFCC (Mel Frequency Cepstral
structure of the new melody would be consistent with a given set       Coefficients) image and passes it through one conv 2D layer with
of sample songs. One important aspect of this work is that they        32 filters generating a (128,1024,32) output that then passes through
propose to use Perlin noise instead of the widely use white noise in   a series of 2 conv2D layers with the same size padding and a Leaky
VAEs. [20] proposed a DL architecture called Variational Recurrent     ReLU non-linear activation function followed by 2x2 downsampling
Autoencoder (VRASH), supported by history, that uses previous          layers. This process keeps halving the images’ size and duplicating
outputs as additional inputs. The authors claim that this model        the number of channels until a (2,16,256) layer is obtained. Then, a
listens to the notes that it has composed already and uses them as     fully connected layer outputs a (4,1) vector that contains the two
additional ”historic” input. In [16] the authors applied VAE tech-     means and the two standard deviations for later sampling.
niques to the generation of musical sequences at various measure          The sampling process takes a (2,1) mean vector and a (2,2) stan-
scales. In a further development of this work, the authors created     dard deviation diagonal matrix and using those parameters we
MusicVAE [17], a network with a self-coding structure that is ca-      sample a (2,1) latent vector z from a normal distribution.
pable of generating latent spaces through which it is possible to         The decoding process takes the (2,1) z latent vector and passes it
generate audio and music content through interpolations in these       throw a fully connected layer that generates a (2,16,256) output that
latent spaces.                                                         then is followed by a series of 2 transposed convD layers followed
    Generative models have also been used for music transcription      by an 2x2 upsampling layer that keeps doubling the size of the
problems. In [18], the authors designed generative long short-term     image and halving the number of channels until a (128,1024,32)
memory (LTSM) networks for music transcription modelling and           output is obtained. This output passes through a last convolutional
composition. Their aim is to develop transcription models of music     layer that outputs the (128,1024,2) MFCC spectral representation
                                                                                                                         IUI ’20 Workshops, Cagliari, Italy,
Latent Chords: Generative Piano Chord Synthesis with Variational Autoencoders


        Figure 2: MFCC representation of a forte chord
                             .


Figure 3: MFCC representation of the forte chord generated
by the network
                                .
of the generated audio. Inverse MFCC and STFT are then used to
reconstruct a 4 second audio signal.

4    DATASET AND MODEL TRAINING
Our dataset consists on 450 recordings of 15 piano chords played at             Figure 4: Two dimensional latent space representation of the
different keys, dynamics and octaves, performed by the main author.             dataset. Chords are arranged in a spiral pattern, and chords
Each recording has a duration of 4 seconds, and were recorded with              are arranged from forte to a piano dynamic.
a sampling rate of 16 kHz in Ableton Live in the wav audio format.
                                                                                batch size of 5, and the training was performed for 500 epochs, the
Piano keys were pressed for three seconds and released at the last
                                                                                full training was done in about 6 hours using one GPU, a nvidia
second. The format of the dataset is the same as used in [5].
                                                                                GTX 1080Ti. We used the standard cost function in VAE networks
   Tbe chords that we included in the dataset were: C2, Dm2, Em2,
                                                                                that has one term corresponding to the reconstruction loss and a
F2, G2, Am2, Bdim2, C3, Dm3, Em3, F3, G3, Am3, Bdim3 and C4. We
                                                                                second term corresponding to the KL divergence loss, but in prac-
used three levels of dynamics: f (forte), mf (mesoforte), p (piano). For
                                                                                tice the model was trained to maximize the ELBO (Evidence Lower
each combination, we produced 10 different recordings, producing
                                                                                BOund) [10, 15]. We tested different β weights for the KL term to
a total of 450 data examples. This dataset can be downloaded from
                                                                                find out how it does affects the clustering of the latent space [8].
the github repository of the project2 .
                                                                                The best results were obtained with β = 1.
   Input: MFCC representation. Instead of using the raw audio
samples as input to the network, we decided to use an MFCC rep-                 5   USE CASES
resentation, which has proven to be very useful for convolutional               Latent space exploration. Figure 4 displays a two dimensional
networks designed for audio content generation [5]. In consequence,             latent space generated by the network. Chords are arranged in
the input to the network is a spectral representation of a 4-second             a spiral pattern following dynamics and octave position. Louder
window of an audio signal, by means of the MFCC transform. The                  chords are positioned in the outer tail of the spiral while softer
calculation of MFCC is done by computing a short-time Fourier                   sound are in close proximity to the center. Chords are also arranged
Transform (STFT) of each audio window, using a 512 stride and a                 by octaves, lower octaves are towards the outer tail while softer
2048 window size, obtaining an image of size (128,1024,2). Magni-               octaves tend to be closer to the center. In this two dimensional
tude and unwrapped phase are coded in different channels of the                 space, the x coordinate seems to be related mainly to chroma, i.e.
image.                                                                          different chords, while the y coordinate is dominated by octave from
   Figure 2 displays the MFCC transform of a 4-second audio record-             lower to higher and dynamics from louder to softer. A remarkable
ing of a piano chord performed forte. Magnitude is shown on top                 property of this latent space is that different chords are arranged
while unwrapped phase is displayed at the bottom. The network                   by thirds, following the pattern C, E, G, B, D, F, A. This means that
outputs a MFCC audio representation as well. Figure 3 displays the              neighboring chords share the largest number of common pitches.
MFCC representation of a 4-second audio recording of a the same                 In general, this latent space is able to separate type of chords.
forte chord of figure 2 but in this case, the chord was generated                   Chord generation. One of the nice properties of latent spaces
by the network by sampling the same position in the latent space                is the ability to generate new chords by selecting positions in the
where the original chord lays.                                                  plane that have not been previously trained by the network. In
   Model training. We used tensorflow 2.0 to implement our model.               figure 5 we show the MFCC coefficients of a completely new chord
For training, we split our dataset leaving 400 examples for train-              generated by the network.
validation, and 50 examples for testing. We used an Adam optimizer                  Chord sequencer. Another creative feature of our network is
with default parameters and learning rate of 3 × 10−5 . We chose a              the exploration of the latent space with predefined trajectories,
                                                                                which allows for the generation of sequence of chords, resulting in
2 https://github.com/CreativAI-UC/TimbreNet/tree/master/datasets/               a certain harmonic space. These trajectories not only encompass
IUI ’20 Workshops, Cagliari, Italy,
                                                                                                          Agustín Macaya, Rodrigo F. Cádiz, Manuel Cartagena, Denis Parra


 Figure 5: MFCC of a new chord generated by the network
                                  .
different chord chromas, but different dynamics and octaves as well.
In figure 6, one possible trajectory is shown. In this case, we can
navigate from piano to forte, and from the thirds octave to the first,
and at the same time we can produce different chords, following a
desired pattern.

6    CONCLUSIONS AND FUTURE WORK
We have constructed Latent Chords, a VAE that generates chords
and chord sequences performed at different level of dynamics and
in different octaves. We were able to represent the dataset in a very
compact two-dimensional latent space where chords are clearly
clustered based on chroma, and where the axes correlate by octave
and dynamic level. Contrary to many previous works reported in the                       Figure 6: One possible trajectory for an exploration of the la-
literature, we used audio recordings of piano chords with musically                      tent space. Trajectories consist on different chords, but also
meaningful variations such as dynamic level and octave positioning.                      on different octaves and dynamics.
We presented two use cases and we have shared our dataset, sound                          [7] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. 2017. Deepbach: a steerable
examples and network architecture to the community.                                           model for bach chorales generation. In Proceedings of the 34th International
                                                                                              Conference on Machine Learning-Volume 70. JMLR. org, 1362–1371.
    We would like to extend our work to a larger dataset, includ-                         [8] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot,
ing new chords chromas, more levels of dynamics, more octave                                  Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-VAE:
variation and include different articulations. We would also like to                          Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR
                                                                                              2, 5 (2017), 6.
explore the design of another neural network devoted to explore                           [9] Eric J Humphrey, Taemin Cho, and Juan P Bello. 2012. Learning a robust tonnetz-
the latent space in musically meaningful ways. This would allow                               space transform for automatic chord recognition. In 2012 IEEE International
us to generate a richer variety of chord music and to customize                               Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 453–456.
                                                                                         [10] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes.
trajectories according to the desires and goals of each composer.                             arXiv preprint arXiv:1312.6114 (2013).
We will also attempt to build an interactive tool such as Moodplay                       [11] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. 2017. On conver-
                                                                                              gence and stability of gans. arXiv preprint arXiv:1705.07215 (2017).
[1] to allow user exploratory search on a latent music space, but                        [12] Filip Korzeniowski and Gerhard Widmer. 2016. Feature learning for chord recog-
with added generative functionality.                                                          nition: The deep chroma extractor. arXiv preprint arXiv:1612.05065 (2016).
                                                                                         [13] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. 2018. Which training
                                                                                              methods for GANs do actually converge? arXiv preprint arXiv:1801.04406 (2018).
ACKNOWLEDGMENTS                                                                          [14] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
This research was funded by the Dirección de Artes y Cultura, Vicer-                          Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.
                                                                                              2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499
rectoría de Investigación from the Pontificia Universidad Católica                            (2016).
de Chile. Also, this work is partially funded by Fondecyt grants                         [15] Rajesh Ranganath, Sean Gerrish, and David Blei. 2014. Black box variational
                                                                                              inference. In Artificial Intelligence and Statistics. 814–822.
#1161328 and #1191791 ANID, Government of Chile, as well as by                           [16] Adam Roberts, Jesse Engel, and Douglas Eck. 2017. Hierarchical variational
the Millenium Institute Foundational Research on Data (IMFD).                                 autoencoders for music. In NIPS Workshop on Machine Learning for Creativity
                                                                                              and Design.
                                                                                         [17] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck.
REFERENCES                                                                                    2018. A hierarchical latent vector model for learning long-term structure in
 [1] Ivana Andjelkovic, Denis Parra, and John O’Donovan. 2016. Moodplay: Interac-             music. arXiv preprint arXiv:1803.05428 (2018).
     tive mood-based music discovery and recommendation. In Proceedings of the 2016      [18] Bob L Sturm, Joao Felipe Santos, Oded Ben-Tal, and Iryna Korshunova. 2016.
     Conference on User Modeling Adaptation and Personalization. ACM, 275–279.                Music transcription modelling and composition using deep learning. arXiv
 [2] Jean-Pierre Briot, Gaëtan Hadjeres, and François-David Pachet. 2019. Deep                preprint arXiv:1604.08723 (2016).
     learning techniques for music generation. Sorbonne Université, UPMC Univ Paris      [19] Aline Weber, Lucas Nunes Alegre, Jim Torresen, and Bruno C. da Silva. 2019. Pa-
     6 (2019).                                                                                rameterized Melody Generation with Autoencoders and Temporally-Consistent
 [3] Jun-qi Deng and Yu-Kwong Kwok. 2016. A Hybrid Gaussian-HMM-Deep Learning                 Noise. In Proceedings of the International Conference on New Interfaces for Musical
     Approach for Automatic Chord Estimation with Very Large Vocabulary.. In ISMIR.           Expression, Marcelo Queiroz and Anna Xambó Sedó (Eds.). UFRGS, Porto Alegre,
     812–818.                                                                                 Brazil, 174–179.
 [4] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. 2018.                  [20] Ivan P Yamshchikov and Alexey Tikhonov. 2017. Music generation with
     MuseGAN: Multi-track sequential generative adversarial networks for symbolic             variational recurrent autoencoder supported by history. arXiv preprint
     music generation and accompaniment. In Thirty-Second AAAI Conference on                  arXiv:1705.05458 (2017).
     Artificial Intelligence.                                                            [21] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. 2017. MidiNet: A convolutional
 [5] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad                     generative adversarial network for symbolic-domain music generation. arXiv
     Norouzi, Douglas Eck, and Karen Simonyan. 2017. Neural audio synthesis of                preprint arXiv:1703.10847 (2017).
     musical notes with wavenet autoencoders. In Proceedings of the 34th International   [22] Xinquan Zhou and Alexander Lerch. 2015. Chord detection using deep learning.
     Conference on Machine Learning-Volume 70. JMLR. org, 1068–1077.                          In Proceedings of the 16th ISMIR Conference, Vol. 53.
 [6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT
     press.

</pre>