-

Machine learning methods for content-based music recommendation systems

Milita Songailaitė

Tomas Krilavičius

0 0 Vytautas Magnus University, Faculty of Informatics , Vileikos street 8, LT-44404 Kaunas , Lithuania

With the increased popularity of online music streaming, people find themselves spending more and more time choosing the content they like. This poses a problem of a fast and accurate music recommendation method, which would let the user ignore the large quantities of unwanted music and choose precisely what he likes. This work presents methods to compare music based entirely on its audio signal properties. For this, we used three different approaches (Gaussian mixture models, dynamic time warping and autoencoders) to calculate the similarity between the given signals. All three experiments were performed on a database consisting of 2511 most popular songs from 10 different genres. The methods were evaluated by comparing algorithms' results with the music similarity results given by the experts. The Gaussian mixture model was the best-evaluated method, while the worst was the autoencoder method.

Machine learning Gaussian mixture models dynamic time warping autoencoder

1. Introduction tems avoid this problem because the recommendations are based purely on the signal’s The music industry has increased rapidly in online audio properties and nothing else. streaming services during the last decade. For exam- In this work, we focus only on content-based ple, in 2011, more than 80% of music sales revenues recommendation systems. A features profile was consisted of physical and digital records [1]; however, generated to characterize each song, which consisted by the year 2021, the majority of the revenues were of various audio signal properties created by several only from the music streaming services [2]. Moreover, different methods. The generated feature profiles the amount of music that people can find online is also were used as songs vectors in the created contentincreasing. For instance, Spotify, one of the largest mu- based recommendation systems. sic broadcasters, offers customers a selection of more The rest of the paper is organized as follows. than 70 million songs, supplementing this selection Related work is provided in Section 2. The by 60,000 new songs each day [ 3 ]. In addition, an- recommendation systems are discussed in Section 3. other popular music broadcaster SoundCloud, uploads The evaluations of the proposed systems are nearly 12 hours of music content online each hour [ 4 ]. provided in Section 4. The conclusions are given in

In order to deal with vast amounts of data, com- Section 5. panies use music recommendation algorithms. These algorithms can be grouped into two main categories: 2. Related work content-based and context-based recommendations [5]. Context-based recommendation systems use The analyzed literature covers two main users’ data (such as liked songs history or the list of topics: content-based music recommendation favorite genres) as well as similar customers’ choices. systems and deep learning usage for audio signals This may lead to the system recommending only the feature generation. songs that other users find similar. In other words, Most of the papers ([ 6 ], [ 7 ], [ 8 ], [9]) on the algorithm will start recommending the most pop- content-based systems used Mel-frequency ular songs among similar users and leave less popu- cepstral coefficients (MFCCs) as the audio profile for lar songs behind. Content based recommendation sys- the songs. The profiles were compared using KIVUS 2022: 27th International Conference on Information Technology, means, Gaussian Mixture Model and Kullback-Leibler May 12, 2022, Kaunas, Lithuania divergence measures, which distinguished the most $ milita.songailaite@stud.vdu.lt (M. Songailaitė); similar profiles and recommended them as the most tomas.krilavicius@vdu.lt (T. Krilavičius) similar songs to the given queries. The system © 2022 Copyright for this paper by its authors. described in [ 10 ] simply used Short-time Fourier CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g UCsEepUerRmitWted ounrdkersChreoaptivePCroomcmeoensdLiincegnsse A(CttrEibuUtioRn-4W.0InSte.ornratgio)nal (CC BY 4.0). torradnesrfortom ctoomcpapartuerethtehegfaetahteurreeds otfimthee sseigrineasl,s.thIne authors used the Locality-sensitive hashing method.

The method worked especially good when detecting song covers. A slightly different approach was introduced in the paper [ 11 ]. The authors used feature vectors consisting of 18 distinct features provided by The Echo Nest API. The feature profiles were clustered using the K-means algorithm to gather clusters of similar songs.

The second part of the literature review focused specifically on the use of deep learning in music recommendation systems. Some of the most popular approaches used deep learning to generate new sound representation vectors for the signals. Papers [ 12 ], [ 13 ] describe how the autoencoder architecture can be used to encode musical style, which will be later used to gener- Figure 1: Workflow of the proposed system ate music. The authors in paper [ 14 ] tested two diferent architectures (OpenL3 and VGGish) for encoding audio signals. These encodings helped the authors distinguish the emotions of the songs and classify them.

Finally, in [ 15 ] authors compared variational autoencoders with LSTM and Recurrent networks. The research showed that almost all deep learning networks outperformed baseline principal components methods for feature vector generation.

3. Recommendation system overview

After analyzing the literature and testing the most popular algorithms, we selected three main directions in which our research was carried out: Genre

Pop Rock Metal EDM

Kpop Country Classical

Jazz Blues Rap

Number of songs 275 246 236 257 265 236 203 252 253 288

3.1. Dataset

1. a Gaussian mixture modelfor feature space

modelling; 2. vector similarity using dynamic time warping

distance; 3. autoencoder network for features encoding.

A dataset consisting of 2511 popular songs was used to evaluate the developed music recommendation systems. The database consisted of songs created by 135 artists from 10 different genres. The distribution of songs across genres is presented in Table 1. All songs The workflow of the experiments is depicted in were converted and stored using uncompressed .wav Figure 1. First, we produced the Mel-Frequency format. This form of signal was the input for all modscale cepstral coefficients, the baseline of the songs els. In addition to the audio signals, we also stored features profile. For each song, we generated the first unique id, song’s artist, album, and genre for each 8 MFCCs at a sample rate of 22 050 Hz. After that, song. However, the metadata was only used to test in order to create songs’ audio profiles, the existing recommendation systems. Therefore, the signals’ auMFCCs were transformed using the three mentioned dio profiles consisted purely of the songs’ audio data. methods. In the end, created recommendation systems were evaluated using two methods: 3.2. Methods 1. finding the average number of the same genre, same artist and same album songs among the generated list of recommended songs; 2. comparing music sorting results of the algorithms with the human ability to sort by similarity.

3.2.1. Gaussian mixture model

The first used music modelling algorithm was Gaussian Mixture Model (GMM). This method was initially mentioned in [ 8 ] paper. First, all 30 s audio signals are clustered into 5 clusters using a Gaussian mixture clastributions were compared using Kullback-Leibler (KL) divergence. This divergence measure shows how one distribution is diferent from the other one. It is possible to calculate it using the (1) formula, where ( ) and ( ) are the probabilistic distributions [ 16 ]. Since the Gaussian mixture model describes each audio proifle as a Gaussian distribution, we can compare several Gaussian models based on this measure.

KL( , ) = ∫ ln ℝ

( ) ( ( ) ) ( ) (1)

The proposed model describes the songs by 5 clusters, each characterized by 8 Gaussian distributions. Each distribution in the first cluster of the query is compared to each distribution in the first cluster of the song from the database. These calculations are repeated for each cluster, and the resulting eight estimates are averaged. Finally, we obtain a vector of 5 values, where each value reflects the similarity between the two clusters of audio signals being comGaussian Mixture Models.

3.2.2. Dynamic time warping method

measure (2), where ( ) is usually the Euclidean described in [18]. The result is an array of songs sorted sifier. The original paper uses only 3 clusters, but testin ascending order according to the calculated dising has shown that a slightly larger number of clusters tances. Finally, the most similar songs are selected work better in the system. The fabricated Gaussian and considered a recommendation for a given query. mixture models defined the characteristics of each audio signal in 5 Gaussian distributions. Next, these dis- 3.2.3. Autoencoder model list.

larity. 1. Assuming that the songs from the same artist, album, or genre are similar, we count how many of those songs show up in the recommendation 2. Comparing music sorting results of the algo

rithms with the human ability to sort by simiThe final tested model was the Autoencoder (AE) network. Autoencoder is a deep learning method, which learns a representation by attempting to encode it and then validates that representation by regenerating the original input from it [19]. Similar to the Gaussian Mixture Model, the main point of this algorithm was to compress the audio signals so that the created features would contain only the most essential information. In this case, both the songs’ in the database and the queries’ MFCCs are encoded with the created Autoencoder model. This way, we get new attribute vectors for each song and the query, and thus, we can now calculate the distance between newly generated features. For this step, we chose Euclidean distance.

Finally, these distances are sorted in ascending order,

closest to the query songs are selected and the top to be a recommendation for the query.

The Autoencoder has encoding and decoding parts.

The encoding part consists of four fully connected layers. These layers compress the flattened MFCC macoder part, which also has four fully connected layers, decodes the array back into its initial dimensions.

We used the ReLU activation function in the hidden layers and the linear activation function in the output work. The network was trained with 100 epochs using

4. Evaluation

get one value, which reflects the similarity of the pared. These values are also averaged. Therefore, we trices into an array of 64 values. After that, the deDynamic Time Warping (DTW) was the second layer. Adam optimized was chosen to optimize the netmethod tested to model songs’ audio profiles. Unlike the Gaussian mixture method, the main point of this a batch size of 100 songs. signals and gives us more freedom when capturing similarities between two different duration audio the recommendation systems: distance measure was chosen since it lets us find distance between the two time series [17]. This larity diferently; therefore, the constructed tests must stay unbiased. In this work, we used two ways to test musical motifs across the audio signal.

Warping distance The evaluation of music recommendations is a rather subjective matter. Many people perceive music simi

4.1. Number of same artists, albums and genre songs in the recommendations list

This test assumes that the songs from the same artist, album, and genre are similar. Thus, we can say that similarity detection is good if the algorithm detects as many same artists, albums, and genre songs as possible. Such a test was performed for each created system. First, we selected a query to test the systems (the query is described in Table 2). Then we analyzed the list of top 50 songs recommended by the algorithms.

The results of the first test are given in Figure 2. It can be observed that the best algorithm in all three categories was the Gaussian Mixture model. No other algorithm had songs from the same album in their recommendation lists. However, this was expected since there are only 11 songs in the selected album, which are not necessarily similar to the query. The other remaining algorithms performed almost equally poorly, although the Autoencoder model found the smallest number of similar songs.

4.2. Music sorting results in comparison with the human ability to sort

In the second test, created algorithms were compared to the human ability to rank songs by their similarity. The experiment involved 23 volunteers who were given ten diferent query songs. The volunteers had to sort three given music clips from the most similar (first place) to the least (third place) for each query. As a result, relative locations of the lined-up songs were provided for each query. Generally, these calculations can be written with a formula (3), where r1, r2, and r3 are the percentages of volunteers who voted for the ifrst, second, and third places, respectively.

rank = 1 ⋅ 1 + 2 ⋅ 2 + 3 ⋅ 3 and genre in the recommendations generated by each algorithm. The X-axis depicts the three performed tests and the Y-axis show the number of similar songs found. GMM_kl Gaussian Mixture Model method; DTWi and DTWd - dependent and independent Dynamic Time Warping method; AE Autoenoder method

MSE surveys and ̂ is the ranking of the algorithms.

Square Error (4) was used, where is the results of the = 1 =1 ∑ ( − ̂ )2 (4)

Figure 3 the accuracy of the methods for each query individually. Table 3 shows the total MSE estimates for all of the methods.

Similarly, as in the first test, the accuracy of the Gaussian mixture method was the highest. However, the accuracies of the two Dynamic Time Warping methods acted slightly diferently in this test. If the (3) first test, the dependent DTW method generated betindependent DTW method had a higher score in the

The algorithms were given the same task: they had ter recommendations in the second test. to sort the given music clips by their similarity to the Pearson Correlation Coeficient was used to comMethods correlation matrix values

GMM DTW DTW

GMM

able for sorting, there were few possible lineup options. Therefore, even if the algorithms misidentified the similarity, there was still a high chance that they would make similar recommendations to each other.

Finally, after the second test, it can be stated that the most accurate method for similar music recommendations was the Gaussian Mixture Model. The remaining algorithms, same as in the first test, gave inferior recommendations.

, are the standard deviations of these fea(see Table 4 and Figure 4).

, = cov( , ) (5)

The methods correlation matrix showed that the DTWi, DTWd, and AE methods had a weak negative correlation with the GMM_KL method. This was due to several queries tests (e. g., Q1 or Q3 in the figure 3), as the results of the rankings of the algorithms in these

5. Conclusions and

recommendations

5.1. Results

In this work, an analysis of existing content-based music recommendation systems was performed. Difbeen developed using the best-found algorithms (Gaussian Mixtures Models, Dynamic Time Warping distance, and Autoencoder networks). Two types of accuracy evaluations were performed for the developed systems: finding the number of same artists, same album, and same genre songs; and algorithms music sorting comparison with the human ranked results.

5.2. Conclusions

The following conclusions were formulated based on the obtained results: 1. The analysis of the algorithms’ recommendations showed that the best performing algorithm was the Gaussian mixture model. The GMM was able to identify similar songs in terms of an audio signal, and the recommendations generated by the algorithm were in line with the majority of survey respondents. 2. Dynamic Time Warping methods performed slightly worse than the Gaussian Mixture Method. Both methods (dependent and independent) were able to find at least several songs from the same artist or genre. The accuracy of the independent method in the first test was higher due to the higher number of songs in the same genre in the recommendations list. In the second test, the dependent method was more accurate. The Dynamic Time Warping method provided almost as accurate recommendations as the Gaussian mixture method in this test. 3. The worst-performing method was modeling the musical features using Autoencoder networks. This method had the lowest accuracy of all the other algorithms tested in both tests. This result suggests that the Autoencoder network is too simple to model complex musical features.

5.3. Recommendations

The following are some recommendations for future plans that will allow us to improve the developed algorithms in the future: 1. Increase the collected database by the number of music signals in it and by the number of variables describing the characteristics of the collected signals. 2. Improve algorithms’ testing methods. So far, the survey is relatively small (10 queries, three songs for each). However, it is planned to increase the scope of this test and increase the number of people participating in the study. 3. Since deep neural network methods are often used in the literature to solve the problem of music modelling, it is planned to extend the usage of these methods for music recommendation in the future. The tested Autoencoder method could be transformed into a more complex neural network structure. Since the modelling of music with Gaussian distributions has proven to be the most successful in this paper, a simple Autoencoder could be changed to a variational Autoencoder. In the variational Autoencoder, each music feature is described by a Gaussian distribution instead of encoding the data into simple vectors. Another potentially suitable variant of the neural network is the recurrent neural network. A recursive neural network would allow the exploration of excerpts from songs of diferent lengths (both longer and shorter), thus creating a universal model for identifying similar music.

Donald J Berndt and James Cliford. “Using Dynamic Time Warping to Find Patterns in Time Series”. en. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (1994), p. 12.

Mohammad Shokoohi-Yekta et al. “Generalizing DTW to the multi-dimensional case requires an adaptive approach”. In: Data mining and knowledge discovery 31.1 (Jan. 2017), pp. 1–31. issn: 1384-5810. doi: 10 . 1007 / s10618 - 016 - 0455 - 0. url: https : / / www . ncbi . nlm . nih . gov / pmc / articles/PMC5668684/ (visited on 08/21/2021).

U.S. Sales

Database . en-US. Available at https : //www.riaa.com/u- s - sales- database/. (Visited on 12/10/ 2021 ).

MID-YEAR 2021 RIAA REVENUE

STATISTICS .

en-US . Available at https://www.riaa.com/wpcontent / uploads / 2021 / 09 / Mid - Year - 2021 - RIAA- Music- Revenue - Report .pdf. url: https: / / www. riaa . com / wp - content / uploads / 2021 / 09 / Mid - Year - 2021 - RIAA - Music - Revenue - Report.pdf (visited on 12/10/ 2021 ).

[3]

Spotify

Revenue and Usage Statistics ( 2021 ). en. https://www.businessofapps.com/data/spotifystatistics/. July 2021 . url: https : / / www . businessofapps . com / data / spotify - statistics/ (visited on 08/16/ 2021 ).

[4]

SoundCloud

Revenue And Usage Statistics ( 2021 ). en. https://www.businessofapps.com/data/soundcloudstatistics/. Sept. 2020 . url: https : / / www . businessofapps . com / data / soundcloud - statistics/ (visited on 12/11/ 2021 ).

In: International Journal of Emerging Technologies in Learning (iJET) 16 (Feb. 12 , 2021 ). doi: 10 .3991/ijet.v16i03. 18851 .

[6]

Terence

Magno and

Carl

Sable . “A comparison of signal based music recommendation to genre labels, collaborative filtering, musicological analysis, human recommendation and random baseline”. en . In: ISMIR 2008 - Session 2a - Music Recommendation and Organization ( 2008 ), p. 6 .

[7]

Beth

Logan and

Ariel

Salomon. “A ContentBased Music Similarity Function” . In: Cambridge Research Laboratory Technical Report Series (Dec . 2002 ).

[8] Jean-Julien Aucouturier and Francois Pachet . “ Finding songs that sound the same” . In: Proc. of IEEE Benelux Workshop on Model based Processing and Coding of Audio . 2002 , pp. 1 - 8 .

Aäron van den Oord , Sander Dieleman, and [17] Benjamin

Schrauwen . “ Deep content-based music recommendation”. eng . In: Advances in Neural Information Processing Systems 26 ( 2013 ).

Vol. 26 . Neural Information Processing Systems Foundation (NIPS) , 2013 . isbn: 978 -1- 63266 - 024 - [18] 4 . url: http://hdl.handle.net/1854/LU- 4324554 (visited on 07/14/ 2021 ).

[10]

Cheng

Yang . “ MACS: music audio characteristic sequence indexing for similarity retrieval” . In: Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575) . New Platz, NY, [19] USA: IEEE, 2001 , pp. 123 - 126 . isbn: 978 -0- 7803 - 7126-2. doi: 10 .1109/ASPAA. 2001 . 969558 . url: http : / / ieeexplore . ieee . org / document / 969558/ (visited on 07/15/ 2021 ).

“Autoencoders” . In: arXiv: 2003 .05991 [cs, stat] ( Apr . 2021 ). arXiv: 2003 .05991. url: http://arxiv.

org/abs/ 2003 .05991 (visited on 08/21/ 2021 ).

[11] Malcolm

Slaney

, Kilian Weinberger, and

William

White . “LEARNING A METRIC FOR MUSIC SIMILARITY”. en . In: ISMIR 2008 - Session 3a - Content-Based

Retrieval

, Categorization and Similarity 1 ( 2008 ), p. 6 .

[12]

Kristy

Choi et al. “ Encoding Musical Style with Transformer Autoencoders” . In: arXiv: 1912 .05537 [cs, eess, stat] ( June 2020 ). arXiv: 1912 .05537. url: http : / / arxiv. org / abs / 1912 .05537 (visited on 08/23/ 2021 ).

[13] Cheng-Zhi Anna Huang et al. “ MUSIC TRANSFORMER: GENERATING MUSIC WITH LONGTERM STRUCTURE”. en . In: International Conference on Learning Representations (ICLR) 2019 ( 2019 ), p. 15 .

[14]

Eunjeong

Koh and

Shlomo

Dubnov . “ Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition” . In: arXiv:2104.06517 [cs, eess] (Apr . 2021 ). arXiv: 2104 .06517. url: http : / / arxiv . org / abs / 2104 . 06517 (visited on 08/23/ 2021 ).

[15]

Fanny

Roche et al. “ Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models”. en . In: IEEE SMC 2019 ( 2019 ), p. 8 .

[16] Yufeng Zhang et al. “ On the Properties of Kullback-Leibler Divergence Between Gaussians” . In: arXiv:2102.05485 [cs, math] (May 27 , 2021 ). arXiv: 2102 .05485. url: http://arxiv.org/ abs/2102.05485 (visited on 03/31/ 2022 ).