<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Machine learning methods for content-based music recommendation systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Milita Songailaitė</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomas Krilavičius</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vytautas Magnus University, Faculty of Informatics</institution>
          ,
          <addr-line>Vileikos street 8, LT-44404 Kaunas</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the increased popularity of online music streaming, people find themselves spending more and more time choosing the content they like. This poses a problem of a fast and accurate music recommendation method, which would let the user ignore the large quantities of unwanted music and choose precisely what he likes. This work presents methods to compare music based entirely on its audio signal properties. For this, we used three different approaches (Gaussian mixture models, dynamic time warping and autoencoders) to calculate the similarity between the given signals. All three experiments were performed on a database consisting of 2511 most popular songs from 10 different genres. The methods were evaluated by comparing algorithms' results with the music similarity results given by the experts. The Gaussian mixture model was the best-evaluated method, while the worst was the autoencoder method.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine learning</kwd>
        <kwd>Gaussian mixture models</kwd>
        <kwd>dynamic time warping</kwd>
        <kwd>autoencoder</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1. Introduction tems avoid this problem because the
recommendations are based purely on the signal’s
The music industry has increased rapidly in online audio properties and nothing else.
streaming services during the last decade. For exam- In this work, we focus only on content-based
ple, in 2011, more than 80% of music sales revenues recommendation systems. A features profile was
consisted of physical and digital records [1]; however, generated to characterize each song, which consisted
by the year 2021, the majority of the revenues were of various audio signal properties created by several
only from the music streaming services [2]. Moreover, different methods. The generated feature profiles
the amount of music that people can find online is also were used as songs vectors in the created
contentincreasing. For instance, Spotify, one of the largest mu- based recommendation systems.
sic broadcasters, offers customers a selection of more The rest of the paper is organized as follows.
than 70 million songs, supplementing this selection Related work is provided in Section 2. The
by 60,000 new songs each day [
        <xref ref-type="bibr" rid="ref4">3</xref>
        ]. In addition, an- recommendation systems are discussed in Section 3.
other popular music broadcaster SoundCloud, uploads The evaluations of the proposed systems are
nearly 12 hours of music content online each hour [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ]. provided in Section 4. The conclusions are given in
      </p>
      <p>
        In order to deal with vast amounts of data, com- Section 5.
panies use music recommendation algorithms. These
algorithms can be grouped into two main categories: 2. Related work
content-based and context-based recommendations
[5]. Context-based recommendation systems use The analyzed literature covers two main
users’ data (such as liked songs history or the list of topics: content-based music recommendation
favorite genres) as well as similar customers’ choices. systems and deep learning usage for audio signals
This may lead to the system recommending only the feature generation.
songs that other users find similar. In other words, Most of the papers ([
        <xref ref-type="bibr" rid="ref7">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ], [9]) on
the algorithm will start recommending the most pop- content-based systems used Mel-frequency
ular songs among similar users and leave less popu- cepstral coefficients (MFCCs) as the audio profile for
lar songs behind. Content based recommendation sys- the songs. The profiles were compared using
KIVUS 2022: 27th International Conference on Information Technology, means, Gaussian Mixture Model and Kullback-Leibler
May 12, 2022, Kaunas, Lithuania divergence measures, which distinguished the most
$ milita.songailaite@stud.vdu.lt (M. Songailaitė); similar profiles and recommended them as the most
tomas.krilavicius@vdu.lt (T. Krilavičius) similar songs to the given queries. The system
© 2022 Copyright for this paper by its authors. described in [
        <xref ref-type="bibr" rid="ref12">10</xref>
        ] simply used Short-time Fourier
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g UCsEepUerRmitWted ounrdkersChreoaptivePCroomcmeoensdLiincegnsse A(CttrEibuUtioRn-4W.0InSte.ornratgio)nal (CC BY 4.0). torradnesrfortom ctoomcpapartuerethtehegfaetahteurreeds otfimthee sseigrineasl,s.thIne
authors used the Locality-sensitive hashing method.
      </p>
      <p>
        The method worked especially good when detecting
song covers. A slightly different approach was
introduced in the paper [
        <xref ref-type="bibr" rid="ref15">11</xref>
        ]. The authors used
feature vectors consisting of 18 distinct features
provided by The Echo Nest API. The feature profiles were
clustered using the K-means algorithm to gather
clusters of similar songs.
      </p>
      <p>
        The second part of the literature review focused
specifically on the use of deep learning in music
recommendation systems. Some of the most popular
approaches used deep learning to generate new sound
representation vectors for the signals. Papers [
        <xref ref-type="bibr" rid="ref16">12</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">13</xref>
        ]
describe how the autoencoder architecture can be used to
encode musical style, which will be later used to gener- Figure 1: Workflow of the proposed system
ate music. The authors in paper [
        <xref ref-type="bibr" rid="ref18">14</xref>
        ] tested two
diferent architectures (OpenL3 and VGGish) for encoding
audio signals. These encodings helped the authors
distinguish the emotions of the songs and classify them.
      </p>
      <p>
        Finally, in [
        <xref ref-type="bibr" rid="ref19">15</xref>
        ] authors compared variational
autoencoders with LSTM and Recurrent networks. The
research showed that almost all deep learning networks
outperformed baseline principal components methods
for feature vector generation.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Recommendation system overview</title>
      <p>After analyzing the literature and testing the most
popular algorithms, we selected three main directions
in which our research was carried out:
Genre</p>
      <p>Pop
Rock
Metal
EDM</p>
      <p>Kpop
Country
Classical</p>
      <p>Jazz
Blues
Rap</p>
      <p>Number of songs
275
246
236
257
265
236
203
252
253
288</p>
      <sec id="sec-2-1">
        <title>3.1. Dataset</title>
        <p>1. a Gaussian mixture modelfor feature space</p>
        <p>modelling;
2. vector similarity using dynamic time warping</p>
        <p>distance;
3. autoencoder network for features encoding.</p>
        <p>A dataset consisting of 2511 popular songs was used
to evaluate the developed music recommendation
systems. The database consisted of songs created by 135
artists from 10 different genres. The distribution of
songs across genres is presented in Table 1. All songs
The workflow of the experiments is depicted in were converted and stored using uncompressed .wav
Figure 1. First, we produced the Mel-Frequency format. This form of signal was the input for all
modscale cepstral coefficients, the baseline of the songs els. In addition to the audio signals, we also stored
features profile. For each song, we generated the first unique id, song’s artist, album, and genre for each
8 MFCCs at a sample rate of 22 050 Hz. After that, song. However, the metadata was only used to test
in order to create songs’ audio profiles, the existing recommendation systems. Therefore, the signals’
auMFCCs were transformed using the three mentioned dio profiles consisted purely of the songs’ audio data.
methods. In the end, created recommendation
systems were evaluated using two methods:
3.2. Methods
1. finding the average number of the same genre,
same artist and same album songs among the
generated list of recommended songs;
2. comparing music sorting results of the
algorithms with the human ability to sort by
similarity.</p>
        <sec id="sec-2-1-1">
          <title>3.2.1. Gaussian mixture model</title>
          <p>
            The first used music modelling algorithm was
Gaussian Mixture Model (GMM). This method was initially
mentioned in [
            <xref ref-type="bibr" rid="ref9">8</xref>
            ] paper. First, all 30 s audio signals are
clustered into 5 clusters using a Gaussian mixture
clastributions were compared using Kullback-Leibler (KL)
divergence. This divergence measure shows how one
distribution is diferent from the other one. It is
possible to calculate it using the (1) formula, where  ( )
and  ( ) are the probabilistic distributions [
            <xref ref-type="bibr" rid="ref20">16</xref>
            ]. Since
the Gaussian mixture model describes each audio
proifle as a Gaussian distribution, we can compare several
Gaussian models based on this measure.
          </p>
          <p>KL( ,  ) = ∫  ln
ℝ</p>
          <p>( )
(  ( ) )
 ( )
(1)</p>
          <p>The proposed model describes the songs by 5
clusters, each characterized by 8 Gaussian distributions.
Each distribution in the first cluster of the query is
compared to each distribution in the first cluster of
the song from the database. These calculations are
repeated for each cluster, and the resulting eight
estimates are averaged. Finally, we obtain a vector of
5 values, where each value reflects the similarity
between the two clusters of audio signals being
comGaussian Mixture Models.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>3.2.2. Dynamic time warping method</title>
          <p>measure (2), where (  ) is usually the Euclidean
described in [18]. The result is an array of songs sorted
sifier. The original paper uses only 3 clusters, but
testin ascending order according to the calculated
dising has shown that a slightly larger number of clusters
tances. Finally, the 
most similar songs are selected
work better in the system. The fabricated Gaussian
and considered a recommendation for a given query.
mixture models defined the characteristics of each
audio signal in 5 Gaussian distributions. Next, these dis- 3.2.3. Autoencoder model
list.</p>
          <p>larity.
1. Assuming that the songs from the same artist,
album, or genre are similar, we count how many
of those songs show up in the recommendation
2. Comparing music sorting results of the
algo</p>
          <p>rithms with the human ability to sort by
simiThe final tested model was the Autoencoder (AE)
network. Autoencoder is a deep learning method, which
learns a representation by attempting to encode it and
then validates that representation by regenerating the
original input from it [19]. Similar to the Gaussian
Mixture Model, the main point of this algorithm was
to compress the audio signals so that the created
features would contain only the most essential
information. In this case, both the songs’ in the database and
the queries’ MFCCs are encoded with the created
Autoencoder model. This way, we get new attribute
vectors for each song and the query, and thus, we can
now calculate the distance between newly generated
features. For this step, we chose Euclidean distance.</p>
          <p>Finally, these distances are sorted in ascending order,</p>
          <p>closest to the query songs are selected
and the top 
to be a recommendation for the query.</p>
          <p>The Autoencoder has encoding and decoding parts.</p>
          <p>The encoding part consists of four fully connected
layers. These layers compress the flattened MFCC
macoder part, which also has four fully connected
layers, decodes the array back into its initial dimensions.</p>
          <p>We used the ReLU activation function in the hidden
layers and the linear activation function in the output
work. The network was trained with 100 epochs using</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Evaluation</title>
      <p>get one value, which reflects the similarity of the
pared. These values are also averaged. Therefore, we trices into an array of 64 values. After that, the
deDynamic Time Warping (DTW) was the second layer. Adam optimized was chosen to optimize the
netmethod tested to model songs’ audio profiles. Unlike
the Gaussian mixture method, the main point of this a batch size of 100 songs.
signals and gives us more freedom when capturing
similarities between two different duration audio the recommendation systems:
distance measure was chosen since it lets us find
distance between the two time series [17]. This larity diferently; therefore, the constructed tests must
stay unbiased. In this work, we used two ways to test
musical motifs across the audio signal.</p>
      <p>Warping distance The evaluation of music recommendations is a rather
subjective matter. Many people perceive music
simi</p>
      <sec id="sec-3-1">
        <title>4.1. Number of same artists, albums and genre songs in the recommendations list</title>
        <p>This test assumes that the songs from the same artist,
album, and genre are similar. Thus, we can say that
similarity detection is good if the algorithm detects as
many same artists, albums, and genre songs as
possible. Such a test was performed for each created system.
First, we selected a query to test the systems (the query
is described in Table 2). Then we analyzed the list of
top 50 songs recommended by the algorithms.</p>
        <p>The results of the first test are given in Figure 2. It
can be observed that the best algorithm in all three
categories was the Gaussian Mixture model. No other
algorithm had songs from the same album in their
recommendation lists. However, this was expected since
there are only 11 songs in the selected album, which
are not necessarily similar to the query. The other
remaining algorithms performed almost equally poorly,
although the Autoencoder model found the smallest
number of similar songs.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Music sorting results in comparison with the human ability to sort</title>
        <p>In the second test, created algorithms were compared
to the human ability to rank songs by their
similarity. The experiment involved 23 volunteers who were
given ten diferent query songs. The volunteers had
to sort three given music clips from the most similar
(first place) to the least (third place) for each query. As
a result, relative locations of the lined-up songs were
provided for each query. Generally, these calculations
can be written with a formula (3), where r1, r2, and
r3 are the percentages of volunteers who voted for the
ifrst, second, and third places, respectively.</p>
        <p>rank = 1 ⋅  1 + 2 ⋅  2 + 3 ⋅  3
and genre in the recommendations generated by each
algorithm. The X-axis depicts the three performed tests and the
Y-axis show the number of similar songs found. GMM_kl
Gaussian Mixture Model method; DTWi and DTWd -
dependent and independent Dynamic Time Warping method; AE
Autoenoder method</p>
        <p>MSE
surveys and  ̂ is the ranking of the algorithms.</p>
        <p>Square Error (4) was used, where  is the results of the

=
1 
 =1
∑ (  −  ̂  )2
(4)</p>
        <p>Figure 3 the accuracy of the methods for each query
individually. Table 3 shows the total MSE estimates for
all of the methods.</p>
        <p>Similarly, as in the first test, the accuracy of the
Gaussian mixture method was the highest. However,
the accuracies of the two Dynamic Time Warping
methods acted slightly diferently in this test. If the
(3) first test, the dependent DTW method generated
betindependent DTW method had a higher score in the</p>
        <p>The algorithms were given the same task: they had ter recommendations in the second test.
to sort the given music clips by their similarity to the
Pearson Correlation Coeficient was used to
comMethods correlation matrix values</p>
        <p>GMM
DTW
DTW</p>
        <p>AE</p>
        <p>GMM</p>
        <p>able for sorting, there were few possible lineup
options. Therefore, even if the algorithms misidentified
the similarity, there was still a high chance that they
would make similar recommendations to each other.</p>
        <p>Finally, after the second test, it can be stated that the
most accurate method for similar music
recommendations was the Gaussian Mixture Model. The remaining
algorithms, same as in the first test, gave inferior
recommendations.</p>
        <p>,   are the standard deviations of these
fea(see Table 4 and Figure 4).</p>
        <p>,
=
cov( ,  )
   
(5)</p>
        <p>The methods correlation matrix showed that the
DTWi, DTWd, and AE methods had a weak negative
correlation with the GMM_KL method. This was due
to several queries tests (e. g., Q1 or Q3 in the figure 3),
as the results of the rankings of the algorithms in these</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusions and</title>
      <p>recommendations</p>
      <sec id="sec-4-1">
        <title>5.1. Results</title>
        <p>In this work, an analysis of existing content-based
music recommendation systems was performed.
Difbeen developed using the best-found algorithms
(Gaussian Mixtures Models, Dynamic Time Warping
distance, and Autoencoder networks). Two types of
accuracy evaluations were performed for the
developed systems: finding the number of same artists,
same album, and same genre songs; and algorithms
music sorting comparison with the human ranked
results.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Conclusions</title>
        <p>The following conclusions were formulated based on
the obtained results:
1. The analysis of the algorithms’
recommendations showed that the best performing algorithm
was the Gaussian mixture model. The GMM was
able to identify similar songs in terms of an
audio signal, and the recommendations generated
by the algorithm were in line with the majority
of survey respondents.
2. Dynamic Time Warping methods performed
slightly worse than the Gaussian Mixture
Method. Both methods (dependent and
independent) were able to find at least several songs
from the same artist or genre. The accuracy
of the independent method in the first test was
higher due to the higher number of songs in the
same genre in the recommendations list. In the
second test, the dependent method was more
accurate. The Dynamic Time Warping method
provided almost as accurate recommendations
as the Gaussian mixture method in this test.
3. The worst-performing method was modeling
the musical features using Autoencoder
networks. This method had the lowest accuracy of
all the other algorithms tested in both tests. This
result suggests that the Autoencoder network is
too simple to model complex musical features.</p>
      </sec>
      <sec id="sec-4-3">
        <title>5.3. Recommendations</title>
        <p>The following are some recommendations for future
plans that will allow us to improve the developed
algorithms in the future:
1. Increase the collected database by the number
of music signals in it and by the number of
variables describing the characteristics of the
collected signals.
2. Improve algorithms’ testing methods. So far,
the survey is relatively small (10 queries, three
songs for each). However, it is planned to
increase the scope of this test and increase the
number of people participating in the study.
3. Since deep neural network methods are often
used in the literature to solve the problem of
music modelling, it is planned to extend the
usage of these methods for music recommendation
in the future. The tested Autoencoder method
could be transformed into a more complex
neural network structure. Since the modelling of
music with Gaussian distributions has proven to
be the most successful in this paper, a simple
Autoencoder could be changed to a variational
Autoencoder. In the variational Autoencoder, each
music feature is described by a Gaussian
distribution instead of encoding the data into simple
vectors. Another potentially suitable variant of
the neural network is the recurrent neural
network. A recursive neural network would allow
the exploration of excerpts from songs of
diferent lengths (both longer and shorter), thus
creating a universal model for identifying similar
music.</p>
        <p>Donald J Berndt and James Cliford. “Using
Dynamic Time Warping to Find Patterns in Time
Series”. en. In: Proceedings of the 3rd
International Conference on Knowledge Discovery and
Data Mining (1994), p. 12.</p>
        <p>Mohammad Shokoohi-Yekta et al. “Generalizing
DTW to the multi-dimensional case requires an
adaptive approach”. In: Data mining and
knowledge discovery 31.1 (Jan. 2017), pp. 1–31. issn:
1384-5810. doi: 10 . 1007 / s10618 - 016 - 0455 - 0.
url: https : / / www . ncbi . nlm . nih . gov / pmc /
articles/PMC5668684/ (visited on 08/21/2021).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>U.S. Sales</given-names>
            <surname>Database</surname>
          </string-name>
          . en-US. Available at https : //www.riaa.com/u- s
          <string-name>
            <surname>-</surname>
          </string-name>
          sales- database/. (Visited on 12/10/
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>MID-YEAR 2021 RIAA REVENUE</surname>
          </string-name>
          <article-title>STATISTICS</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>en-US</surname>
          </string-name>
          . Available at https://www.riaa.com/wpcontent / uploads / 2021 / 09 / Mid - Year
          <string-name>
            <surname>-</surname>
          </string-name>
          2021
          <string-name>
            <surname>- RIAA- Music-</surname>
          </string-name>
          Revenue
          <source>- Report</source>
          .pdf. url: https: / / www. riaa . com / wp - content / uploads / 2021 / 09 / Mid - Year
          <string-name>
            <surname>-</surname>
          </string-name>
          2021
          <string-name>
            <surname>- RIAA - Music -</surname>
          </string-name>
          Revenue
          <source>- Report.pdf (visited on 12/10/</source>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Spotify</given-names>
            <surname>Revenue</surname>
          </string-name>
          and Usage Statistics (
          <year>2021</year>
          ). en. https://www.businessofapps.com/data/spotifystatistics/.
          <source>July</source>
          <year>2021</year>
          . url: https : / / www . businessofapps . com / data / spotify - statistics/ (visited on 08/16/
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>SoundCloud</given-names>
            <surname>Revenue And Usage Statistics</surname>
          </string-name>
          (
          <year>2021</year>
          ). en. https://www.businessofapps.com/data/soundcloudstatistics/. Sept.
          <year>2020</year>
          . url: https : / / www . businessofapps . com / data / soundcloud - statistics/ (visited on 12/11/
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>In: International Journal of Emerging Technologies in Learning (iJET) 16 (Feb. 12</source>
          ,
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .3991/ijet.v16i03.
          <fpage>18851</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Terence</given-names>
            <surname>Magno</surname>
          </string-name>
          and
          <string-name>
            <given-names>Carl</given-names>
            <surname>Sable</surname>
          </string-name>
          .
          <article-title>“A comparison of signal based music recommendation to genre labels, collaborative filtering, musicological analysis, human recommendation and random baseline”. en</article-title>
          .
          <source>In: ISMIR 2008 - Session 2a - Music Recommendation and Organization</source>
          (
          <year>2008</year>
          ), p.
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Beth</given-names>
            <surname>Logan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ariel</given-names>
            <surname>Salomon. “A ContentBased Music Similarity</surname>
          </string-name>
          <article-title>Function”</article-title>
          .
          <source>In: Cambridge Research Laboratory Technical Report Series (Dec</source>
          .
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Jean-Julien Aucouturier</surname>
            and
            <given-names>Francois</given-names>
          </string-name>
          <string-name>
            <surname>Pachet</surname>
          </string-name>
          . “
          <article-title>Finding songs that sound the same”</article-title>
          .
          <source>In: Proc. of IEEE Benelux Workshop on Model based Processing and Coding of Audio</source>
          .
          <year>2002</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Aäron van den Oord</surname>
            , Sander Dieleman, and [17]
            <given-names>Benjamin</given-names>
          </string-name>
          <string-name>
            <surname>Schrauwen</surname>
          </string-name>
          . “
          <article-title>Deep content-based music recommendation”. eng</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          Vol.
          <volume>26</volume>
          .
          <source>Neural Information Processing Systems Foundation (NIPS)</source>
          ,
          <year>2013</year>
          . isbn:
          <fpage>978</fpage>
          -1-
          <fpage>63266</fpage>
          -
          <fpage>024</fpage>
          - [18]
          <article-title>4</article-title>
          . url: http://hdl.handle.net/1854/LU- 4324554
          <source>(visited on 07/14/</source>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Cheng</given-names>
            <surname>Yang</surname>
          </string-name>
          . “
          <article-title>MACS: music audio characteristic sequence indexing for similarity retrieval”</article-title>
          .
          <source>In: Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575)</source>
          . New Platz, NY, [19] USA: IEEE,
          <year>2001</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>126</lpage>
          . isbn:
          <fpage>978</fpage>
          -0-
          <fpage>7803</fpage>
          - 7126-2. doi:
          <volume>10</volume>
          .1109/ASPAA.
          <year>2001</year>
          .
          <volume>969558</volume>
          . url: http : / / ieeexplore . ieee . org / document / 969558/ (visited on 07/15/
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>“Autoencoders”</article-title>
          . In: arXiv:
          <year>2003</year>
          .05991 [cs, stat] (
          <year>Apr</year>
          .
          <year>2021</year>
          ). arXiv:
          <year>2003</year>
          .05991. url: http://arxiv.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          org/abs/
          <year>2003</year>
          .05991 (visited on 08/21/
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Malcolm</surname>
            <given-names>Slaney</given-names>
          </string-name>
          , Kilian Weinberger, and
          <string-name>
            <given-names>William</given-names>
            <surname>White</surname>
          </string-name>
          .
          <article-title>“LEARNING A METRIC FOR MUSIC SIMILARITY”. en</article-title>
          . In: ISMIR 2008
          <string-name>
            <surname>- Session</surname>
          </string-name>
          3a -
          <string-name>
            <surname>Content-Based</surname>
            <given-names>Retrieval</given-names>
          </string-name>
          ,
          <source>Categorization and Similarity</source>
          <volume>1</volume>
          (
          <year>2008</year>
          ), p.
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Kristy</given-names>
            <surname>Choi</surname>
          </string-name>
          et al. “
          <article-title>Encoding Musical Style with Transformer Autoencoders”</article-title>
          . In: arXiv:
          <year>1912</year>
          .05537 [cs, eess, stat] (
          <year>June 2020</year>
          ). arXiv:
          <year>1912</year>
          .05537. url: http : / / arxiv. org / abs /
          <year>1912</year>
          .05537 (visited on 08/23/
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Cheng-Zhi Anna</surname>
          </string-name>
          Huang et al. “
          <article-title>MUSIC TRANSFORMER: GENERATING MUSIC WITH LONGTERM STRUCTURE”. en</article-title>
          .
          <source>In: International Conference on Learning Representations (ICLR)</source>
          <year>2019</year>
          (
          <year>2019</year>
          ), p.
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Eunjeong</given-names>
            <surname>Koh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shlomo</given-names>
            <surname>Dubnov</surname>
          </string-name>
          . “
          <article-title>Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition”</article-title>
          .
          <source>In: arXiv:2104.06517 [cs, eess] (Apr</source>
          .
          <year>2021</year>
          ). arXiv:
          <volume>2104</volume>
          .06517. url: http : / / arxiv . org / abs / 2104 . 06517 (visited on 08/23/
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Fanny</given-names>
            <surname>Roche</surname>
          </string-name>
          et al. “
          <article-title>Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models”. en</article-title>
          .
          <source>In: IEEE SMC</source>
          <year>2019</year>
          (
          <year>2019</year>
          ), p.
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Yufeng</surname>
          </string-name>
          Zhang et al. “
          <article-title>On the Properties of Kullback-Leibler Divergence Between Gaussians”</article-title>
          .
          <source>In: arXiv:2102.05485 [cs, math] (May</source>
          <year>27</year>
          ,
          <year>2021</year>
          ). arXiv:
          <volume>2102</volume>
          .05485. url: http://arxiv.org/ abs/2102.05485 (visited on 03/31/
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>