=Paper= {{Paper |id=Vol-3611/paper16 |storemode=property |title=Machine learning methods for content-based music recommendation systems |pdfUrl=https://ceur-ws.org/Vol-3611/paper16.pdf |volume=Vol-3611 |authors=Milita Songailaitė,Tomas Krilavičius |dblpUrl=https://dblp.org/rec/conf/ivus/SongailaiteK22 }} ==Machine learning methods for content-based music recommendation systems== https://ceur-ws.org/Vol-3611/paper16.pdf
                           Machine learning methods for content-based music
                           recommendation systems
                           Milita Songailaitė1, Tomas Krilavičius1
                           1 Vytautas Magnus University, Faculty of Informatics, Vileikos street 8, LT-44404 Kaunas, Lithuania



                                                                         Abstract
                                                                         With the increased popularity of online music streaming, people find themselves spending more and more time choosing
                                                                         the content they like. This poses a problem of a fast and accurate music recommendation method, which would let the user
                                                                         ignore the large quantities of unwanted music and choose precisely what he likes. This work presents methods to compare
                                                                         music based entirely on its audio signal properties. For this, we used three different approaches (Gaussian mixture models,
                                                                         dynamic time warping and autoencoders) to calculate the similarity between the given signals. All three experiments
                                                                         were performed on a database consisting of 2511 most popular songs from 10 different genres. The methods were evaluated
                                                                         by comparing algorithms’ results with the music similarity results given by the experts. The Gaussian mixture model was
                                                                         the best-evaluated method, while the worst was the autoencoder method.

                                                                         Keywords
                                                                         Machine learning, Gaussian mixture models, dynamic time warping, autoencoder


                           1. Introduction                                                                                                                   tems     avoid    this    problem     because     the
                                                                                                                                                             recommendations are based purely on the signal’s
                           The music industry has increased rapidly in online                                                                                audio properties and nothing else.
                           streaming services during the last decade. For exam-                                                                                 In this work, we focus only on content-based
                           ple, in 2011, more than 80% of music sales revenues                                                                               recommendation systems. A features profile was
                           consisted of physical and digital records [1]; however,                                                                           generated to characterize each song, which consisted
                           by the year 2021, the majority of the revenues were                                                                               of various audio signal properties created by several
                           only from the music streaming services [2]. Moreover,                                                                             different methods. The generated feature profiles
                           the amount of music that people can find online is also                                                                           were used as songs vectors in the created content-
                           increasing. For instance, Spotify, one of the largest mu-                                                                         based recommendation systems.
                           sic broadcasters, offers customers a selection of more                                                                               The rest of the paper is organized as follows.
                           than 70 million songs, supplementing this selection                                                                               Related work is provided in Section 2. The
                           by 60,000 new songs each day [3]. In addition, an-                                                                                recommendation systems are discussed in Section 3.
                           other popular music broadcaster SoundCloud, uploads                                                                               The evaluations of the proposed systems are
                           nearly 12 hours of music content online each hour [4].                                                                            provided in Section 4. The conclusions are given in
                              In order to deal with vast amounts of data, com-                                                                               Section 5.
                           panies use music recommendation algorithms. These
                           algorithms can be grouped into two main categories:
                           content-based and context-based recommendations                                                                                   2. Related work
                           [5]. Context-based recommendation systems use                                                                                     The analyzed literature covers two main
                           users’ data (such as liked songs history or the list of                                                                           topics: content-based music recommendation
                           favorite genres) as well as similar customers’ choices.                                                                           systems and deep learning usage for audio signals
                           This may lead to the system recommending only the                                                                                 feature generation.
                           songs that other users find similar. In other words,                                                                                 Most of the papers ([6], [7], [8], [9]) on
                           the algorithm will start recommending the most pop-                                                                               content-based      systems    used     Mel-frequency
                           ular songs among similar users and leave less popu-                                                                               cepstral coefficients (MFCCs) as the audio profile for
                           lar songs behind. Content based recommendation sys-                                                                               the songs. The profiles were compared using K-
                           IVUS 2022: 27th International Conference on Information Technology,
                                                                                                                                                             means, Gaussian Mixture Model and Kullback-Leibler
                           May 12, 2022, Kaunas, Lithuania                                                                                                   divergence measures, which distinguished the most
                           $ milita.songailaite@ stud.vdu.lt (M. Songailaitė);                                                                               similar profiles and recommended them as the most
                           tomas.krilavicius@vdu.lt (T. Krilavičius)                                                                                         similar songs to the given queries. The system
                                                                   © 2022 Copyright for this paper by its authors.                                           described in [10] simply used Short-time Fourier
                                                                   Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                CEUR
                                              http://ceur-ws.org
                                                                                                                                                             transform to capture the features of the signals. In
                                Workshop
                                Proceedings
                                              ISSN 1613-0073
                                                                   CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                                                                             order to compare the gathered time series, the
                                                                                                                                                             authors used the Locality-sensitive hashing method.




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
The method worked especially good when detecting
song covers. A slightly different approach was
introduced in the paper [11]. The authors used
feature vectors consisting of 18 distinct features pro-
vided by The Echo Nest API. The feature profiles were
clustered using the K-means algorithm to gather clus-
ters of similar songs.
   The second part of the literature review focused spe-
cifically on the use of deep learning in music recom-
mendation systems. Some of the most popular approa-
ches used deep learning to generate new sound repre-
sentation vectors for the signals. Papers [12], [13] de-
scribe how the autoencoder architecture can be used to
encode musical style, which will be later used to gener- Figure 1: Workflow of the proposed system
ate music. The authors in paper [14] tested two differ-
ent architectures (OpenL3 and VGGish) for encoding
                                                         Table 1
audio signals. These encodings helped the authors dis- The distribution of songs number in each genre
tinguish the emotions of the songs and classify them.
Finally, in [15] authors compared variational autoen-                  Genre      Number of songs
                                                                        Pop              275
coders with LSTM and Recurrent networks. The re-
                                                                       Rock              246
search showed that almost all deep learning networks
                                                                       Metal             236
outperformed baseline principal components methods                     EDM               257
for feature vector generation.                                         Kpop              265
                                                                       Country            236
                                                                       Classical          203
3. Recommendation system                                                 Jazz             252
   overview                                                             Blues             253
                                                                         Rap              288
After analyzing the literature and testing the most
popular algorithms, we selected three main directions
in which our research was carried out:
                                                          3.1. Dataset
   1. a Gaussian mixture model for feature space
      modelling;                                          A dataset consisting of 2511 popular songs was used
   2. vector similarity using dynamic time warping        to evaluate the developed music recommendation sys-
      distance;                                           tems. The database consisted of songs created by 135
   3. autoencoder network for features encoding.          artists from 10 different genres. The distribution of
                                                          songs across genres is presented in Table 1. All songs
The workflow of the experiments is depicted in            were converted and stored using uncompressed .wav
Figure 1. First, we produced the Mel-Frequency            format. This form of signal was the input for all mod-
scale cepstral coefficients, the baseline of the songs    els. In addition to the audio signals, we also stored
features profile. For each song, we generated the first   unique id, song’s artist, album, and genre for each
8 MFCCs at a sample rate of 22 050 Hz. After that,        song. However, the metadata was only used to test
in order to create songs’ audio profiles, the existing    recommendation systems. Therefore, the signals’ au-
MFCCs were transformed using the three mentioned          dio profiles consisted purely of the songs’ audio data.
methods. In the end, created recommendation
systems were evaluated using two methods:
                                                          3.2. Methods
   1. finding the average number of the same genre,
      same artist and same album songs among the          3.2.1. Gaussian mixture model
      generated list of recommended songs;                The first used music modelling algorithm was Gaus-
   2. comparing music sorting results of the algo-        sian Mixture Model (GMM). This method was initially
      rithms with the human ability to sort by simi-      mentioned in [8] paper. First, all 30 s audio signals are
      larity.                                             clustered into 5 clusters using a Gaussian mixture clas-
sifier. The original paper uses only 3 clusters, but test-   in ascending order according to the calculated dis-
ing has shown that a slightly larger number of clusters      tances. Finally, the 𝑁 most similar songs are selected
work better in the system. The fabricated Gaussian           and considered a recommendation for a given query.
mixture models defined the characteristics of each au-
dio signal in 5 Gaussian distributions. Next, these dis-     3.2.3. Autoencoder model
tributions were compared using Kullback-Leibler (KL)
divergence. This divergence measure shows how one            The final tested model was the Autoencoder (AE) net-
distribution is different from the other one. It is pos-     work. Autoencoder is a deep learning method, which
sible to calculate it using the (1) formula, where 𝑝(𝑥)      learns a representation by attempting to encode it and
and 𝑞(𝑥) are the probabilistic distributions [16]. Since     then validates that representation by regenerating the
the Gaussian mixture model describes each audio pro-         original input from it [19]. Similar to the Gaussian
file as a Gaussian distribution, we can compare several      Mixture Model, the main point of this algorithm was
Gaussian models based on this measure.                       to compress the audio signals so that the created fea-
                                                             tures would contain only the most essential informa-
                                𝑝(𝑥)                         tion. In this case, both the songs’ in the database and
          𝐷KL (𝑃, 𝑄) = ∫ ln            𝑝(𝑥)𝑑𝑥          (1)
                         ℝ𝑑   ( 𝑞(𝑥) )                       the queries’ MFCCs are encoded with the created Au-
                                                             toencoder model. This way, we get new attribute vec-
   The proposed model describes the songs by 5 clus-
                                                             tors for each song and the query, and thus, we can
ters, each characterized by 8 Gaussian distributions.
                                                             now calculate the distance between newly generated
Each distribution in the first cluster of the query is
                                                             features. For this step, we chose Euclidean distance.
compared to each distribution in the first cluster of
                                                             Finally, these distances are sorted in ascending order,
the song from the database. These calculations are
                                                             and the top 𝑁 closest to the query songs are selected
repeated for each cluster, and the resulting eight es-
                                                             to be a recommendation for the query.
timates are averaged. Finally, we obtain a vector of
                                                                The Autoencoder has encoding and decoding parts.
5 values, where each value reflects the similarity be-
                                                             The encoding part consists of four fully connected lay-
tween the two clusters of audio signals being com-
                                                             ers. These layers compress the flattened MFCC ma-
pared. These values are also averaged. Therefore, we
                                                             trices into an array of 64 values. After that, the de-
get one value, which reflects the similarity of the
                                                             coder part, which also has four fully connected lay-
Gaussian Mixture Models.
                                                             ers, decodes the array back into its initial dimensions.
                                                             We used the ReLU activation function in the hidden
3.2.2. Dynamic time warping method                           layers and the linear activation function in the output
Dynamic Time Warping (DTW) was the second                    layer. Adam optimized was chosen to optimize the net-
method tested to model songs’ audio profiles. Unlike         work. The network was trained with 100 epochs using
the Gaussian mixture method, the main point of this          a batch size of 100 songs.
method was to directly compare MFCCs without
further modification. Thus, the vectors of the
generated MFCC matrices were compared by
                                                             4. Evaluation
calculating the Dynamic Time Warping distance                The evaluation of music recommendations is a rather
measure (2), where 𝛿(𝑤𝑘 ) is usually the Euclidean           subjective matter. Many people perceive music simi-
distance between the two time series [17]. This              larity differently; therefore, the constructed tests must
distance measure was chosen since it lets us find            stay unbiased. In this work, we used two ways to test
similarities between two different duration audio            the recommendation systems:
signals and gives us more freedom when capturing
musical motifs across the audio signal.                         1. Assuming that the songs from the same artist,
                                  𝑝                                album, or genre are similar, we count how many
             𝐷𝑇 𝑊 (𝑆, 𝑇 ) = min ∑ 𝛿(𝑤𝑘 )               (2)         of those songs show up in the recommendation
                               [ 𝑘=1     ]                         list.
   Since audio signals are multidimensional time se-            2. Comparing music sorting results of the algo-
ries, the Dynamic Time Warping distance was mod-                   rithms with the human ability to sort by simi-
ified to work with multidimensional data. There                    larity.
were two selected modifications: the dependent DTW
(DTWd) and independent DTW (DTWi), both of them
described in [18]. The result is an array of songs sorted
Table 2
Metainformation of the selected query
             Name of the song      Kamikaze
             Artist                Eminem
             Album                 Kamikaze
             Genre                   Rap
             Start time              1:42
             End time                2:12
             Duration, s              30



4.1. Number of same artists, albums
     and genre songs in the
     recommendations list
                                                              Figure 2: The number of songs by the same artist, album,
This test assumes that the songs from the same artist,        and genre in the recommendations generated by each algo-
album, and genre are similar. Thus, we can say that           rithm. The X-axis depicts the three performed tests and the
similarity detection is good if the algorithm detects as      Y-axis show the number of similar songs found. GMM_kl -
many same artists, albums, and genre songs as possi-          Gaussian Mixture Model method; DTWi and DTWd - depen-
ble. Such a test was performed for each created system.       dent and independent Dynamic Time Warping method; AE -
First, we selected a query to test the systems (the query     Autoenoder method
is described in Table 2). Then we analyzed the list of
top 50 songs recommended by the algorithms.
                                                              Table 3
   The results of the first test are given in Figure 2. It    Overall estimation of the Mean Square Error for each
can be observed that the best algorithm in all three          method
categories was the Gaussian Mixture model. No other
                                                                                 Method       MSE
algorithm had songs from the same album in their rec-                            𝐺𝑀𝑀 𝐾 𝐿     0.7345
ommendation lists. However, this was expected since                               𝐷𝑇 𝑊𝑖      0.8452
there are only 11 songs in the selected album, which                              𝐷𝑇 𝑊𝑑      0.7632
are not necessarily similar to the query. The other re-                            𝐴𝐸        0.9605
maining algorithms performed almost equally poorly,
although the Autoencoder model found the smallest
number of similar songs.
                                                     query. The positions of the songs sorted by the algo-
                                                     rithms were denoted by integers (1, 2, and 3). In or-
4.2. Music sorting results in comparison der to estimate the accuracy of the methods, the Mean
      with the human ability to sort                 Square Error (4) was used, where 𝑌 is the results of the
In the second test, created algorithms were compared surveys and 𝑌̂ is the ranking of the algorithms.
to the human ability to rank songs by their similar-                             𝑛
                                                                              1             2
ity. The experiment involved 23 volunteers who were                  𝑀𝑆𝐸 = ∑ (𝑌𝑖 − 𝑌̂ 𝑖 )                  (4)
given ten different query songs. The volunteers had                           𝑛 𝑖=1
to sort three given music clips from the most similar        Figure 3 the accuracy of the methods for each query
(first place) to the least (third place) for each query. As
                                                          individually. Table 3 shows the total MSE estimates for
a result, relative locations of the lined-up songs were   all of the methods.
provided for each query. Generally, these calculations       Similarly, as in the first test, the accuracy of the
can be written with a formula (3), where r1, r2, and      Gaussian mixture method was the highest. However,
r3 are the percentages of volunteers who voted for the    the accuracies of the two Dynamic Time Warping
first, second, and third places, respectively.            methods acted slightly differently in this test. If the
                                                          independent DTW method had a higher score in the
               rank = 1 ⋅ 𝑟1 + 2 ⋅ 𝑟2 + 3 ⋅ 𝑟3        (3) first test, the dependent DTW method generated bet-
   The algorithms were given the same task: they had ter recommendations in the second test.
to sort the given music clips by their similarity to the     Pearson Correlation Coefficient was used to com-
                                                             Table 4
                                                             Methods correlation matrix values
                                                                             GMM𝐾 𝐿 DTW𝑖           DTW𝑑        AE
                                                                  GMM𝐾 𝐿       1.000     -0.319    -0.281    -0.469
                                                                   DTW𝑖        -0.319     1.000     0.959     0.881
                                                                   DTW𝑑        -0.281     0.959     1.000     0.795
                                                                     AE        -0.469     0.881     0.795     1.000



                                                             tests differed significantly. Another finding is that all
                                                             poorly rated algorithms (DTWi, DTWd, and AE) work
                                                             reasonably similarly. The structure of the survey itself
                                                             determined this phenomenon. Because for each query,
                                                             the user and algorithms had only three songs avail-
Figure 3: The Mean Square Error function of algorithms       able for sorting, there were few possible lineup op-
sorting results compared to the volunteers sorting. The X-   tions. Therefore, even if the algorithms misidentified
axis depicts the ten performed tests and the Y-axis show     the similarity, there was still a high chance that they
the MSE estimates. GMM_kl - Gaussian Mixture Model           would make similar recommendations to each other.
method; DTWi and DTWd - dependent and independent Dy-           Finally, after the second test, it can be stated that the
namic Time Warping methods; AE - Autoenoder method           most accurate method for similar music recommenda-
                                                             tions was the Gaussian Mixture Model. The remaining
                                                             algorithms, same as in the first test, gave inferior rec-
                                                             ommendations.


                                                             5. Conclusions and
                                                                recommendations
                                                             5.1. Results
                                                             In this work, an analysis of existing content-based
                                                             music recommendation systems was performed. Dif-
Figure 4: Methods correlation matrix heatmap. GMM_kl -       ferent systems for recommending similar songs have
Gaussian Mixture Model method; DTWi and DTWd - depen-
                                                             been developed using the best-found algorithms
dent and independent Dynamic Time Warping methods; AE
                                                             (Gaussian Mixtures Models, Dynamic Time Warping
- Autoenoder method
                                                             distance, and Autoencoder networks). Two types of
                                                             accuracy evaluations were performed for the devel-
                                                             oped systems: finding the number of same artists,
pare how similar are the created models. The cor-
                                                             same album, and same genre songs; and algorithms
relation was calculated between the error functions
                                                             music sorting comparison with the human ranked re-
for each tested query using the formula (5), where
                                                             sults.
cov(X,Y) is the covariance matrix of the two features,
and 𝜎𝑋 , 𝜎𝑌 are the standard deviations of these fea-
tures. The results are presented in a correlation matrix 5.2. Conclusions
(see Table 4 and Figure 4).
                                                            The following conclusions were formulated based on
                            cov(𝑋 , 𝑌 )                     the obtained results:
                    𝜌𝑋 ,𝑌 =                             (5)
                               𝜎𝑋 𝜎𝑌                           1. The analysis of the algorithms’ recommenda-
   The methods correlation matrix showed that the                 tions showed that the best performing algorithm
DTWi, DTWd, and AE methods had a weak negative                    was the Gaussian mixture model. The GMM was
correlation with the GMM_KL method. This was due                  able to identify similar songs in terms of an au-
to several queries tests (e. g., Q1 or Q3 in the figure 3),       dio signal, and the recommendations generated
as the results of the rankings of the algorithms in these
      by the algorithm were in line with the majority             the exploration of excerpts from songs of differ-
      of survey respondents.                                      ent lengths (both longer and shorter), thus cre-
   2. Dynamic Time Warping methods performed                      ating a universal model for identifying similar
      slightly worse than the Gaussian Mixture                    music.
      Method. Both methods (dependent and inde-
      pendent) were able to find at least several songs
      from the same artist or genre. The accuracy           References
      of the independent method in the first test was
                                                            [1] U.S. Sales Database. en-US. Available at https :
      higher due to the higher number of songs in the
                                                                //www.riaa.com/u- s- sales- database/. (Visited
      same genre in the recommendations list. In the
                                                                on 12/10/2021).
      second test, the dependent method was more
      accurate. The Dynamic Time Warping method             [2] MID-YEAR 2021 RIAA REVENUE STATISTICS.
      provided almost as accurate recommendations               en-US. Available at https://www.riaa.com/wp-
      as the Gaussian mixture method in this test.              content / uploads / 2021 / 09 / Mid - Year - 2021 -
   3. The worst-performing method was modeling                  RIAA- Music- Revenue- Report.pdf. url: https:
      the musical features using Autoencoder net-               / / www. riaa . com / wp - content / uploads / 2021 /
      works. This method had the lowest accuracy of             09 / Mid - Year - 2021 - RIAA - Music - Revenue -
      all the other algorithms tested in both tests. This       Report.pdf (visited on 12/10/2021).
      result suggests that the Autoencoder network is       [3] Spotify Revenue and Usage Statistics (2021). en.
      too simple to model complex musical features.             https://www.businessofapps.com/data/spotify-
                                                                statistics/. July 2021. url: https : / / www .
5.3. Recommendations                                            businessofapps . com / data / spotify - statistics/
                                                                (visited on 08/16/2021).
The following are some recommendations for future
                                                            [4] SoundCloud            Revenue        And    Us-
plans that will allow us to improve the developed al-
                                                                age           Statistics        (2021).     en.
gorithms in the future:
                                                                https://www.businessofapps.com/data/soundcloud-
   1. Increase the collected database by the number             statistics/. Sept. 2020. url: https : / / www .
      of music signals in it and by the number of vari-         businessofapps . com / data / soundcloud -
      ables describing the characteristics of the col-          statistics/ (visited on 12/11/2021).
      lected signals.                                       [5]   Umair Javed et al. “A Review of Content-Based
   2. Improve algorithms’ testing methods. So far,                and Context-Based Recommendation Systems”.
      the survey is relatively small (10 queries, three           In: International Journal of Emerging Technolo-
      songs for each). However, it is planned to in-              gies in Learning (iJET) 16 (Feb. 12, 2021). doi:
      crease the scope of this test and increase the              10.3991/ijet.v16i03.18851.
      number of people participating in the study.
                                                            [6]   Terence Magno and Carl Sable. “A compari-
   3. Since deep neural network methods are often
                                                                  son of signal based music recommendation to
      used in the literature to solve the problem of
                                                                  genre labels, collaborative filtering, musicolog-
      music modelling, it is planned to extend the us-
                                                                  ical analysis, human recommendation and ran-
      age of these methods for music recommendation
                                                                  dom baseline”. en. In: ISMIR 2008 – Session 2a –
      in the future. The tested Autoencoder method
                                                                  Music Recommendation and Organization (2008),
      could be transformed into a more complex neu-
                                                                  p. 6.
      ral network structure. Since the modelling of
      music with Gaussian distributions has proven to       [7]   Beth Logan and Ariel Salomon. “A Content-
      be the most successful in this paper, a simple Au-          Based Music Similarity Function”. In: Cam-
      toencoder could be changed to a variational Au-             bridge Research Laboratory Technical Report Se-
      toencoder. In the variational Autoencoder, each             ries (Dec. 2002).
      music feature is described by a Gaussian distri-      [8]   Jean-Julien Aucouturier and Francois Pachet.
      bution instead of encoding the data into simple             “Finding songs that sound the same”. In: Proc.
      vectors. Another potentially suitable variant of            of IEEE Benelux Workshop on Model based Pro-
      the neural network is the recurrent neural net-             cessing and Coding of Audio. 2002, pp. 1–8.
      work. A recursive neural network would allow
 [9] Aäron van den Oord, Sander Dieleman, and [17] Donald J Berndt and James Clifford. “Using Dy-
     Benjamin Schrauwen. “Deep content-based mu-           namic Time Warping to Find Patterns in Time
     sic recommendation”. eng. In: Advances in Neu-        Series”. en. In: Proceedings of the 3rd Interna-
     ral Information Processing Systems 26 (2013).         tional Conference on Knowledge Discovery and
     Vol. 26. Neural Information Processing Systems        Data Mining (1994), p. 12.
     Foundation (NIPS), 2013. isbn: 978-1-63266-024- [18] Mohammad Shokoohi-Yekta et al. “Generalizing
     4. url: http://hdl.handle.net/1854/LU-4324554         DTW to the multi-dimensional case requires an
     (visited on 07/14/2021).                              adaptive approach”. In: Data mining and knowl-
[10] Cheng Yang. “MACS: music audio characteris-           edge discovery 31.1 (Jan. 2017), pp. 1–31. issn:
     tic sequence indexing for similarity retrieval”.      1384-5810. doi: 10 . 1007 / s10618 - 016 - 0455 - 0.
     In: Proceedings of the 2001 IEEE Workshop on the      url: https : / / www . ncbi . nlm . nih . gov / pmc /
     Applications of Signal Processing to Audio and        articles/PMC5668684/ (visited on 08/21/2021).
     Acoustics (Cat. No.01TH8575). New Platz, NY, [19] Dor Bank, Noam Koenigstein, and Raja Giryes.
     USA: IEEE, 2001, pp. 123–126. isbn: 978-0-7803-       “Autoencoders”. In: arXiv:2003.05991 [cs, stat]
     7126-2. doi: 10.1109/ASPAA.2001.969558. url:          (Apr. 2021). arXiv: 2003.05991. url: http://arxiv.
     http : / / ieeexplore. ieee. org / document / 969558/ org/abs/2003.05991 (visited on 08/21/2021).
     (visited on 07/15/2021).
[11]   Malcolm Slaney, Kilian Weinberger, and
       William White. “LEARNING A METRIC FOR
       MUSIC SIMILARITY”. en. In: ISMIR 2008 – Ses-
       sion 3a – Content-Based Retrieval, Categorization
       and Similarity 1 (2008), p. 6.
[12]   Kristy Choi et al. “Encoding Musical Style
       with    Transformer       Autoencoders”.          In:
       arXiv:1912.05537 [cs, eess, stat] (June 2020).
       arXiv: 1912.05537. url: http : / / arxiv. org / abs /
       1912.05537 (visited on 08/23/2021).
[13]   Cheng-Zhi Anna Huang et al. “MUSIC TRANS-
       FORMER: GENERATING MUSIC WITH LONG-
       TERM STRUCTURE”. en. In: International Con-
       ference on Learning Representations (ICLR) 2019
       (2019), p. 15.
[14]   Eunjeong Koh and Shlomo Dubnov. “Com-
       parison and Analysis of Deep Audio Embed-
       dings for Music Emotion Recognition”. In:
       arXiv:2104.06517 [cs, eess] (Apr. 2021). arXiv:
       2104.06517. url: http : / / arxiv . org / abs / 2104 .
       06517 (visited on 08/23/2021).
[15]   Fanny Roche et al. “Autoencoders for music
       sound modeling: a comparison of linear, shal-
       low, deep, recurrent and variational models”. en.
       In: IEEE SMC 2019 (2019), p. 8.
[16]   Yufeng Zhang et al. “On the Properties of
       Kullback-Leibler Divergence Between Gaus-
       sians”. In: arXiv:2102.05485 [cs, math] (May 27,
       2021). arXiv: 2102.05485. url: http://arxiv.org/
       abs/2102.05485 (visited on 03/31/2022).