Analyzing the Characteristics of Shared Playlists for Music Recommendation Dietmar Jannach Iman Kamehkhosh Geoffray Bonnin TU Dortmund, Germany TU Dortmund, Germany TU Dortmund, Germany dietmar.jannach@tu-dortmund.de iman.kamehkhosh@tu-dortmund.de geoffray.bonnin@tu-dortmund.de ABSTRACT 1. INTRODUCTION The automated generation of music playlists – as supported The automated creation of playlists or personalized radio by modern music services like last.fm or Spotify – represents stations is a typical feature of today’s online music plat- a special form of music recommendation. When designing forms and music streaming services. In principle, standard a “playlisting” algorithm, the question arises which kind of recommendation algorithms based on collaborative filtering quality criteria the generated playlists should fulfill and if or content-based techniques can be applied to generate a there are certain characteristics like homogeneity, diversity ranked list of musical tracks given some user preferences or freshness that make the playlists generally more enjoyable or past listening history. For several reasons, the gener- for the listeners. In our work, we aim to obtain a better un- ation of playlists however represents a very specific music derstanding of such desired playlist characteristics in order recommendation problem. Personal playlists are, for exam- to be able to design better algorithms in the future. The ple, often created with a certain goal or usage context (e.g., research approach chosen in this work is to analyze several sports, relaxation, driving) in mind. Furthermore, in con- thousand playlists that were created and shared by users on trast to relevance-ranked recommendation lists used in other music platforms based on musical and meta-data features. domains, playlists typically obey some homogeneity and co- Our first results for example reveal that factors like pop- herence criteria, i.e., there are quality characteristics that ularity, freshness and diversity play a certain role for users are related to the transitions between the tracks or to the when they create playlists manually. Comparing such user- playlist as a whole. generated playlists with automatically created ones more- In the research literature, a number of approaches for the over shows that today’s online playlisting services sometimes automation of the playlist generation process have been pro- generate playlists which are quite different from user-created posed, see, e.g., [2, 6, 8, 10, 11] or the recent survey in ones. Finally, we compare the user-created playlists with [3]. Some of them for example take a seed song or artist playlists generated with a nearest-neighbor technique from as an input and look for similar tracks; others try to find the research literature and observe even stronger differences. track co-occurrence patterns in existing playlists. In some This last observation can be seen as another indication that approaches, playlist generation is considered as an optimiza- the accuracy-based quality measures from the literature are tion problem. Independent of the chosen technique, a com- probably not sufficient to assess the effectiveness of playlist- mon problem when designing new playlisting algorithms is ing algorithms. to assess whether or not the generated playlists will be posi- tively perceived by the listeners. User studies and online ex- Categories and Subject Descriptors periments are unfortunately particularly costly in the music domain. Researchers therefore often use offline experimen- H.3.3 [Information Storage and Retrieval]: Information tal designs and for example use existing playlists shared by Search and Retrieval; H.5.5 [Information Interfaces and users on music platforms as a basis for their evaluations. The Presentation]: Sound and Music Computing assumption is that these “hand-crafted” playlists are of good quality; typical measures used in the literature include the General Terms Recall [8] or the Average Log-Likelihood (ALL) [11]. Un- Playlist generation, Music recommendation fortunately, both measures have their limitations, see also [2]. The Recall measure for example tells us how good an Keywords algorithm is at predicting the tracks selected by the users, Music, playlist, analysis, algorithm, evaluation but does not explicitly capture specific aspects such as the homogeneity or the smoothness of track transitions. To design better and more comprehensive quality mea- sures, we however first have to answer the question of what users consider to be desirable characteristics of playlists or what the driving principles are when users create playlists. In the literature, a few works have studied this aspect using Proceedings of the 6th Workshop on Recommender Systems and the Social Web different approaches, e.g., user studies [1, 7] or analyzing fo- (RSWeb 2014), collocated with ACM RecSys 2014, 10/06/2014, Foster City, CA, USA. rum posts [5]. The work presented in this paper continues Copyright held by the authors. these lines of research. Our research approach is however . different from previous works as we aim to identify patterns Reynolds et al. [12] made an online survey that revealed in a larger set of manually created playlists that were shared that the context and environment like the location activity by users of three different online music platforms. To be able or the weather can have an influence both on the listeners’ to take a variety of potential driving factors into account in mood and on the track selection behavior of playlist cre- our analysis, we have furthermore collected various types of ators. Finally, the study presented in [9] again confirmed meta-data and musical features of the playlist tracks from the importance of artists, genres and mood in the playlist public music databases. creation process. Overall, with our analyses we hope to obtain insights on In this discussion, we have focused on previous attempts the principles which an automated playlist generation sys- to understand how users create playlists and what their char- tem should observe to end up with better-received or more acteristics are. Playlist generation algorithms however do “natural” playlists. To test if current music services and not necessarily have to rely on such knowledge. Instead, a nearest-neighbor algorithm from the literature generate one can follow a statistical approach and only look at co- playlists that observe the identified patterns and make sim- occurrences and transitions of tracks in existing playlists and ilar choices as real users, we conducted an experiment in use these patterns when creating new playlists, see e.g., [2] which we analyzed commonalities and differences between or [4]. This way, the quality factors respected by human automatically generated and user-provided playlists. playlist creators are implicitly taken into account. Such Before reporting the details of our first analyses, we will approaches, however, cannot be directly applied for many first discuss previous works in the next section. types of playlist generation settings, e.g., for creating “the- matic” playlists (e.g., Christmas Songs) or for creating play- lists that only contain tracks that have certain musical fea- 2. PREVIOUS WORKS tures. Pure statistical methods are not aware of these char- acteristics and the danger exists that tracks are included In [14], Slaney and White addressed the question if users that do not match the purpose of the list and thus lead to have a tendency to create very homogeneous or rather di- a limited overall quality. verse playlists. As a basis for determining the diversity they relied on an objective measure based on genre information about the tracks. Each track was considered as a point in 3. CHARACTERISTICS OF PLAYLISTS the genre space and the diversity was then determined by The ultimate goal of our research is to analyze the struc- calculating the volume of an ellipsoid enclosing the tracks of ture and characteristics of playlists in order to better under- the playlist. An analysis of 887 user-created playlists indi- stand the principles used by the users to create them. This cated that diversity can be considered to be a driving factor section is a first step toward this goal. as users typically create playlists covering several genres. Sarroff and Casey more recently [13] focused on track tran- 3.1 Data sources sitions in album playlists and made an analysis to determine As a basis for the first analyses that we report in this if there are certain musical characteristics that are particu- paper, we used two types of playlist data. larly important. One of the results of their investigation was that fade durations and the mean timbre of the beginnings 3.1.1 Hand-crafted playlists and endings of consecutive tracks seem to have a strong in- We used samples of hand-crafted playlists from three dif- fluence on the ordering of the tracks. ferent sources. One set of playlists was retrieved via the Generally, our work is similar to [14] and [13] in that we public API of last.fm1 , one was taken from the Art of the rely on user-created (“hand-crafted”) playlists and look at Mix (AotM) website2 , and a third one was provided to us by meta-data and musical features of the tracks to identify po- 8tracks3 . To enhance the data quality, we corrected artist tentially important patterns. The aspects we cover in this misspellings using the API of last.fm. paper were however not covered in their work and our anal- Overall, we analyzed over 10,000 playlists containing about ysis is based on larger datasets. 108,000 different tracks of about 40,000 different artists. As Cunningham et al., [5], in contrast, relied on another form a first attempt toward our goal, we retrieved the features of track-related information and looked at the user posts in listed in Table 1 using the public API of last.fm and The the forum of the Art of the Mix web site. According to their Echo Nest (tEN), and the MusicBrainz database. analysis, the typical principles for setting up the playlists Some dataset characteristics are shown in Table 2. The mentioned by the creators were related to the artist, genre, “usage count” statistics express how often tracks and artists style, event or activity but also the intended purpose, con- appeared overall in the playlists. When selecting the playlists, text or mood. Some users also talked about the smoothness we made sure that they do not simply contain album list- of track transitions and how many tracks of one single artist ings. The datasets are partially quite different, e.g., with should be included in playlists. Placing the most “impor- respect to the average playlist lengths. The 8tracks dataset tant” track at the end of a playlist was another strategy furthermore has the particularity that users are not allowed mentioned by some of the playlist creators. to include more than two tracks of one artist, in case they A different form of identifying playlist creation principles want to share their playlist with others. is to conduct laboratory studies with users. The study re- Figure 1 shows the distributions of playlist lengths. As ported in [7] for example involved 52 subjects and indicated can be seen, the distributions are quite different across the that the first and the last tracks can play an important role datasets. On 8tracks, a playlist generally has to comprise for the quality of a playlist. In another study, Andric and 1 Haus [1] concluded that the ordering of tracks is not im- http://www.last.fm 2 portant when the playlist mainly contains tracks which the http://www.artofthemix.org 3 users like in general. http://8tracks.com at least 8 tracks. The lengths of the last.fm playlists seem Source Information Description to follow a normal distribution with a maximum frequency last.fm Tags Top tags assigned by users to value at around 20 tracks. Finally, the sizes of the AotM the track. playlists are much more equally distributed. last.fm Playcounts Total number of times the users played the track. 3.1.2 Generated playlists tEN Genres Genres of the artist of the To assess if the playlists generated by today’s online ser- track. Multiple genres can be vices are similar to those created by users, we used the public assigned to a single artist. API of The Echo Nest. We chose this service because it uses tEN Danceability Suitability of the track for a very large database and allows the generation of playlists dancing, based on various in- from several seed tracks, as opposed to, for instance, iTunes formation including the beat Genius or last.fm radios. We split the existing hand-crafted strength and the stability of playlists in half, provided the first half of the list as seed the tempo. tracks to the music service and then analyzed the character- tEN Energy Intensity released throughout istics of the playlist returned by The Echo Nest and com- the track, based on various in- pared them to the patterns that we found in hand-crafted formation including the loud- playlists. Instead of observing whether a playlister gener- ness and segment durations. ates playlists that are generally similar to playlists created by hand, our goal here is to break down their different char- tEN Loudness Overall loudness of the track acteristics and observe on what specific dimensions they dif- in decibels (dB). fer. Notice that using the second half as seed would not be tEN Tempo Speed of the track estimated appropriate as the order of the tracks may be important. in beats per minute (BPM). We also draw our attention to the ability of the algorithms tEN Hotttnesss Current reputation of the of the literature to reproduce the characteristics of hand- track based on its activity on crafted playlists. According to some recent research, one of some web sites crawled by the the most competitive approaches in terms of recall is the developers. simple k-nearest-neighbors (kNN) method [2, 8]. More pre- MB Release year Year of release of the corre- cisely, given some seed tracks, the algorithm extracts the k sponding album. most similar playlists based on the number of shared items and recommends the tracks of these playlists. This algo- Table 1: Additional retrieved information. rithm does not require a training step and scans the entire set of available playlists for each recommendation. 3.2 Detailed observations lastfm AotM 8tracks In the following sections, we will look at general distribu- Playlists 1,172 5,043 3,130 tions of different track characteristics. Tracks 24,754 61,935 29,732 Artists 9,925 23,029 13,379 3.2.1 Popularity of tracks Avg. tracks/playlist 26.0 19.7 12.5 The goal of the first analysis here is to determine if users Avg. artists/playlist 16.8 17.8 11.5 tend to position tracks in playlists depending on their pop- Avg. genres/playlist 2.7 3.5 3.4 ularity. In our analysis, we measure the popularity in terms Avg. tags/playlist 473.4 418.7 297.4 of play counts. Play counts were taken from last.fm, be- Avg. track usage count 1.2 1.6 1.3 cause this is one of the most popular services and the cor- Avg. artist usage count 3.0 4.3 2.9 responding values can be considered indicative for a larger user group. Table 2: Some basic statistics of the datasets. For the measurement, we split the playlists into two parts of equal size and then determined the average play counts on last.fm for the tracks for each half. To measure to which ex- tent the user community favors certain tracks in the playlists, 1200 we calculated the Gini index, a standard measure of inequal- Aotm ity4 . Table 3 shows the results. In the last column, we re- 1000 last.fm port the statistics for the tracks returned by The Echo Nest Frequencies 800 (tEN) and kNN playlisters5 . We provided the first half of 8tracks 600 the hand-crafted playlists as seed tracks and the playlisters 400 had to select the same number of tracks as the number of remaining tracks. 200 The results show that users actually tend to place more 0 popular items in the first part of the list in all datasets, 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 when play counts are considered. The Echo Nest playlister Playlist sizes does not seem to take that form of popularity into account 4 We organized the average play counts in 100 bins. Figure 1: Distribution of playlists sizes. 5 We determined 10 as the best neighborhood size for our data sets based on the recall value, see Section 4. Play counts 1st half 2nd half tEN measure, we compared the creation year of each playlist with last.fm 1,007k 893k 629k the average release year of its tracks. We limit our analysis AotM 671k 638k 606k to the last.fm and 8tracks datasets because we only could 8tracks 953k 897k 659k acquire creation dates for these two. Gini index 1st half 2nd half tEN 0.18 last.fm 0.06 0.04 0.04 0.16 Relative frequency AotM 0.20 0.18 0.22 0.14 8tracks 0.09 0.09 0.08 8tracks 0.12 0.1 last.fm Play counts 1st half 2nd half kNN 0.08 last.fm 1,110k 943k 1,499k 0.06 AotM 645k 617k 867k 0.04 8tracks 1,008k 984k 1,140k 0.02 0 Gini index 1st half 2nd half kNN 0 5 10 15 20 25 30 last.fm 0.12 0.09 0.33 Average freshness of playlists (years) AotM 0.26 0.23 0.43 8tracks 0.15 0.12 0.28 Figure 2: Distribution of average freshness of Table 3: Popularity of tracks in playlists (last.fm playlists (comparing playlist creation date and track play counts) and concentration bias (Gini coeffi- release date). cient). Figure 2 shows the statistics for both datasets. We orga- nized the data points in bins (x-axis), where each bin repre- and recommends on average less popular tracks. These dif- sents an average-freshness level, and then counted how many ferences are statistically significant according to a Student’s playlists fall into these levels. The relative frequencies are t-test (p < 10−5 for The Echo Nest playlister and p < 10−7 shown on the y-axis. The result are very similar for both for the kNN playlister). This behavior indicates also that datasets, with a slight tendency to include older tracks for The Echo Nest is successfully replicating the fact that the last.fm. On both datasets, more than half of the playlists second halves of playlists are supposed to be less popular contain tracks that were released on average in the last 5 than the first half. years, the most frequent average age being between 4 and The Gini index reveals that there is a slightly stronger con- 5 years for last.fm and between 3 and 4 years for 8tracks. centration on some tracks in the first half for two of three Similarly, on both datasets, more than 75% of the playlists datasets and the diversity slightly increases in the second contain tracks that were released on average in the last 8 part. The absolute numbers cannot be directly compared years. across datasets, but for the AotM dataset the concentra- We also analyzed the standard deviation of the resulting tion is generally much higher, which is also indicated by the freshness values and observed that more than half of the higher “track reuse” in Table 2. Interestingly, The Echo Nest playlists have a standard deviation of less than 4 (years), playlister quite nicely reproduces the behavior of real users while more than 75% have a standard deviation of less than 7 with respect to the diversity of popularity. (years) on both datasets. Overall, this suggests that playlists In the lower part of Table 3, we show the results for made by users are often homogeneous with regard to the the kNN method. Note that these statistics are based on release date. a different sample of the playlists than the previous mea- Computing the freshness for the generated playlists would surement. The reason is that both The Echo Nest and the require to configure the playlisters in such a way that they kNN playlisters cannot produce playlists for all of the first select only tracks that were not released after the playlists’ halves provided as seed tracks. We therefore considered only creation years. Unfortunately, The Echo Nest does not allow playlists, for which the corresponding algorithm could pro- such a configuration. Moreover, for the kNN approach, the duce a playlist. playlists that are more recent would have to be ignored, Unlike the playlister of The Echo Nest, the kNN method which would lead to a too small sample size and not very has a strong trend to recommend mostly very popular items. reliable results anymore. This can be caused by the fact that the kNN method by design recommends tracks that are often found in similar 3.2.3 Homogeneity and diversity playlists. Moreover, based on the lower half of Table 3, the Homogeneity and diversity can be determined in a variety popularity correlates strongly with the seed track popularity. of ways. In the following, we will use simple measures based As a result, the kNN shows a potentially undesirable trend on artist and genre counts. The genres correspond to the to reinforce already popular items to everyone. At the same genres of the artists of the tracks retrieved from The Echo time, it concentrates the track selection on a comparable Nest. Basic figures for artist and genre diversity are already small number of tracks as indicated by the very high value given in Table 2. On AotM, for example, having several for the Gini coefficient. tracks of an artist in a playlist is not very common6 . On last.fm, we in contrast very often see two or more tracks of 3.2.2 The role of freshness Next, we analyzed if there is a tendency of users to create 6 On 8tracks, artist repetitions are limited due to license con- playlists that mainly contain recently released tracks. As a straints 0.25 one artist in a playlist. A similar, very rough estimate can Energy [0,1] be made for the genre diversity. If we ordered the tracks of 0.2 a playlist by genre, we would encounter a different genre on Hotttnesss [0,1] Relative frequency last.fm only after having listened to about 10 tracks. On 0.15 Loudness [-100,100] AotM and 8tracks, in contrast, playlists on average cover Danceability [0,1] more genres. 0.1 Tempo [0,500] Table 4 shows the diversities of the first and second halves 0.05 of the hand-crafted playlists, and for the automatic selec- tions using the first halves as seeds. As a measure of di- 0 versity, we simply counted the number of artists and genres 0 20 40 60 80 100 and divided by the corresponding number of tracks. The Scale values in Table 4 correspond the averages of these diversity measures. Figure 3: Distribution of The Echo Nest track mu- sical features independently of playlists. 1st half 2nd half tEN last.fm artists 0.74 0.76 0.93 genres 2.26 2.30 2.12 0.16 8tracks - Energy AotM artists 0.93 0.93 0.94 0.14 AotM - Energy genres 3.26 3.22 2.41 8tracks artists 0.97 0.98 0.99 last.fm - Energy 0.12 genres 3.74 3.85 2.89 8tracks - Hotttnesss Relative frequency 0.1 AotM - Hotttnesss 1st half 2nd half kNN last.fm - Hotttnesss 0.08 last.fm artists 0.74 0.76 0.87 genres 2.32 2.26 3.11 0.06 AotM artists 0.94 0.94 0.91 0.04 genres 3.27 3.21 3.70 8tracks artists 0.97 0.98 0.93 0.02 genres 3.94 3.92 4.06 0 0 0.2 0.4 0.6 0.8 1 Table 4: Diversity of playlists (Number of artists Energy and Hotttnesss and genres divided by the corresponding number of tracks). Figure 4: Distribution of mean energy and “hottt- nesss” levels in playlists. Regarding the diversity of the hand-crafted playlists, the tables show that users tend to keep a same level of artist and genre diversity throughout the playlists. We can also notice To understand if people tend to place tracks with specific that the playlists of last.fm are much more homogeneous. feature values into their playlists, we then computed the The diversity values of the automatic selections reveal sev- distribution of the average feature values of each playlist. eral things. First, The Echo Nest playlister tends to always Figure 4 shows the results of this measurement for the en- maximize the artist diversity independently of the diversity ergy and “hotttnesss” features. For all the other features of the seeds; on the contrary, the kNN playlister lowered the (danceability, loudness and tempo), the distributions were initial artist diversities, except on the last.fm dataset, where similar to those of Figure 3, which could mean that they are it increased them, though less than The Echo Nest playlister. generally not particularly important for the users. Regarding the genre diversity, we can observe an opposite When looking at the energy feature, we see that users tend tendency for both playlisters: The Echo Nest playlister tends to include tracks from a comparably narrow energy spectrum to reduce the genre diversity while the kNN playlister tends with a low average energy level, even though there exist to increase it. Again, these difference are statistically signif- more high-energy tracks in general as shown in Figure 3. A icant (p < 0.03 for The Echo Nest playlister and p < 0.006 similar phenomenon of concentration on a certain range of for the kNN playlister). Overall, the resulting diversities of values can be observed for the “hotttnesss” feature. As a the both approaches tend to be rather dissimilar to those of side aspect, we can observe that the tracks shared on AotM the hand-crafted playlists. are on average slightly less “hottt” than those of both other platforms7 . 3.2.4 Musical features (The Echo Nest) We finally draw our attention to the feature distributions Figure 3 shows the overall relative frequency distribution of the generated playlists. Figure 5 as an example shows of the numerical features from The Echo Nest listed in Ta- the distributions of the energy and “hotttnesss” factors for ble 1 for the set of tracks appearing in our playlists on a 7 normalized scale. For the loudness feature, for example, we The results for the “hotttnesss” we report here correspond see that most tracks have values between 40 and 50 on the to the values at the time when we retrieved the data using the API of The Echo Nest, and not to those at the time when normalized scale. This would translate into an actual loud- the playlists were created. This is not important as we do ness value of -20 to 0 returned by The Echo Nest, given that not look at the distributions independently, but compare the range is -100 to 100. them to the distributions in Figure 3. 0.1 1st half 1st half 2nd half tEN 0.09 2nd half last.fm artists 0.19 0.18 0 0.08 tEN genres 0.43 0.40 0.56 0.07 Relative frequeny 0.06 kNN10 energy 0.76 0.71 0.77 0.05 hotttnesss 0.81 0.76 0.83 0.04 AotM artists 0.05 0.05 0 0.03 genres 0.24 0.22 0.50 0.02 energy 0.75 0.74 0.75 0.01 hotttnesss 0.83 0.82 0.85 0 8tracks artists 0.02 0.01 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy genres 0.22 0.22 0.52 0.25 energy 0.73 0.71 0.76 1st half hotttnesss 0.81 0.79 0.85 2nd half 0.2 tEN 1st half 2nd half kNN Relative frequency kNN10 0.15 last.fm artists 0.22 0.21 0.02 genres 0.44 0.42 0.14 0.1 energy 0.76 0.76 0.75 hotttnesss 0.83 0.82 0.83 0.05 AotM artists 0.05 0.05 0.03 0 genres 0.22 0.21 0.13 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 energy 0.75 0.74 0.73 Hotttnesss hotttnesss 0.83 0.82 0.84 8tracks artists 0.02 0.01 0.03 Figure 5: Comparison of the distribution of energy genres 0.22 0.22 0.17 and “hotttnesss” levels for hand-crafted and gener- energy 0.74 0.73 0.74 ated playlists. hotttnesss 0.82 0.80 0.84 Table 5: Coherence of first, second and generated the first halves and second halves of the playlists of all three halves. datasets, together with the distributions of the tracks se- lected by The Echo Nest and kNN playlisters. The figure shows that The Echo Nest playlister tends to produce a distribution that is quite similar to the distribu- Another interesting phenomenon is the high artist coher- tion of the seed tracks. The kNN playlister, in contrast, ence values on the last.fm dataset. These values indicate tends to concentrate the distributions toward the maximum that last.fm users have a surprisingly strong tendency to values of the distributions of the seeds. We could observe group tracks from the same artist together, which was not this phenomenon of concentration for all the features on successfully reproduced by the two playlisters. Both playlis- all three datasets, except for the danceability on the AotM ters actually seem to have a tendency to produce always dataset. the same coherence values, independently of the coherence values of the seed. A last interesting result is the high co- 3.2.5 Transitions and Coherence herence of artist genres on the AotM and 8tracks datasets – We now focus on the importance of transitions between the high genre coherence values on last.fm can be explained the tracks, and define the coherence of a playlist as the av- by the high artist coherence values. erage similarity between its consecutive tracks. Such simi- larities can be computed according to various criteria. We used the binary cosine similarity of the genres and artists8 , and the Euclidean linear similarity for the numerical track features of The Echo Nest. Table 5 shows the corresponding 4. STANDARD ACCURACY METRICS results for the first and second halves of the hand-crafted playlists, and for the automatic selections using the first Our analysis so far has revealed some particular charac- halves as seeds. teristics of user-created playlists. Furthermore, we observed We can first see that for all datasets and for all criteria, the that the nearest-neighbor playlisting scheme can produce second halves of the playlists have a lower coherence than playlists that are quite different to those generated by the the first halves. If we assume that the coherence is repre- commercial Echo Nest service, e.g., in terms of average track sentative of the effort of the users to create good playlists, popularity (Table 3). then the tracks of the second halves seem to be slightly less In the research literature, “hit rates” (recall) and the av- carefully selected than those of the first halves. erage log-likelihood (ALL) are often used to compare the quality of playlists generated by different algorithms [2, 8, 8 In the case of artists, this means that the similarity equals 11]. The goal of our next experiment was to find out how 1 if both tracks have the same artist, and 0 else. The met- The Echo Nest playlister performs on these measures. As ric thus measures the proportion of cases when the users it is not possible to acquire probability values for the tracks consecutively selected tracks from the same artist. selected by The Echo Nest playlister, the ALL cannot be used9 . In the following we thus only focus on the precision With respect to the evaluation protocol, note that we only and recall. measured precision and recall when the playlister was able to The upper part of Figure 6 shows the recall values at return a playlist continuation given the seed tracks. This was list length 100 for the different datasets10 . Again, we split however not always the case for both techniques. In Table 6, the playlists and used the first half as seed tracks. Recall we therefore report the detailed coverage figures, which show was then computed by comparing the computed playlists that the kNN method was more often able to produce a with the “hidden” tracks of the original playlist. We mea- playlist. If recall is measured for all seed playlists, the dif- sured recall for tracks, artists, genres and tags. The results ferences between the algorithms are even larger. When mea- show that the kNN method quite clearly outperforms the suring precision for all playlists, the differences between the playlister of The Echo Nest on the recall measures across all playlisters become very small. datasets except for the artist recall for the last.fm dataset. The differences are statistically significant for all the ex- Dataset tEN kNN periments except for the track and artists recall on last.fm last.fm 28.33 66.89 (p < 10−6 ) according to a Student’s t-test. As expected, AotM 42.75 86.52 the kNN method leads to higher absolute values for larger 8tracks 35.3 43.8 datasets as more neighbors can be found. Table 6: Coverage of the playlisters. 0.8 0.7 Overall, measuring precision and recall when comparing 0.6 0.5 generated playlists with those provided by users in our view 0.4 represents only one particular form of assessing the quality 0.3 of a playlist generator and should be complemented with 0.2 additional measures. Precision and recall as measured in our 0.1 experiments for example do not consider track transitions. 0 There is also no “punishment” if a generated playlist contains the Echo kNN10 the Echo kNN10 the Echo kNN10 Nest Nest Nest individual non-fitting tracks that would hurt the listener’s overall enjoyment. last.fm AotM 8tracks track recall artist recall genre recall tag recall 0.5 5. PUBLIC AND PRIVATE PLAYLISTS 0.45 Some music platforms and in particular 8tracks let their 0.4 0.35 users create “private” playlists which are not visible to oth- 0.3 ers and public ones that for example are shared and used 0.25 for social interaction like parties, motivation for team sport 0.2 0.15 or romantic evening. The question arises if public playlists 0.1 have different characteristics than those that were created 0.05 0 for personal use only, e.g., because sharing playlists to some the Echo kNN10 the Echo kNN10 the Echo kNN10 extent can also serve the purpose of creating a public image Nest Nest Nest of oneself. last.fm AotM 8tracks We made an initial analysis on the 8tracks dataset. Ta- track precision artist precision genre precision tag precision ble 7 shows the average popularity of the tracks in the 8tracks playlists depending on whether they were in “public” or “pri- vate” playlists (the first category contains 2679 playlists and Figure 6: Recall and Precision for the covered cases. the second 451). As can be seen, the tracks of the private playlists are much more popular on average than the tracks The lower part of Figure 6 presents the precision results. in the public playlists. Moreover, as indicated by the cor- The precision values for tracks are as expected very low and responding Gini coefficients, the popular tracks are almost close to zero which is caused by the huge set of possible equally distributed across the playlists. Furthermore, Fig- tracks and the list length of 100. We can however observe a ure 7 shows the corresponding freshness values. We can see higher precision for the kNN method on the AotM dataset that the private playlists generally contained more recent (p < 10−11 ), which is the largest dataset. Regarding artist, tracks than public playlists. genre and tag prediction, The Echo Nest playlister lead to a higher precision (p < 10−3 ) than the kNN playlister on all Play counts Gini index datasets. Public playlists 870k 0.20 Private playlists 935k 0.06 9 Another possible measure is the Mean Reciprocal Rank (MRR). Applied to playlist generation, one limitation of Table 7: Popularity of tracks in 8tracks public and this metric is that it corresponds to the assumption that private playlists and Gini index. the rank of the test track or artist to predict should be as high as possible in the recommendation list, although many other tracks or artist may be more relevant and should be These results can be interpreted at least in two different ranked before. ways. First, users might create some playlists for their per- 10 sonal use to be able to repeatedly listen to the latest popular We could not measure longer list lengths as 100 is the max- imum playlist length returned by The Echo Nest. tracks. They probably do not share these playlists because 0.16 Last, we plan to extend our experiments and analysis by 0.14 considering other music services, in particular last.fm radios, Relative frequency Public playlists 0.12 and other playlisting algorithms, in particular algorithms 0.1 Private playlists that exploit content information. 0.08 0.06 7. REFERENCES 0.04 [1] A. Andric and G. Haus. Estimating Quality of 0.02 Playlists by Sight. In Proc. AXMEDIS, pages 68–74, 0 2005. 0 5 10 15 20 25 30 [2] G. Bonnin and D. Jannach. Evaluating the Quality of Average freshness of 8tracks playlists (years) Playlists Based on Hand-Crafted Samples. In Proc. ISMIR, pages 263–268, 2013. Figure 7: Distribution of average freshness of [3] G. Bonnin and D. Jannach. Automated generation of 8tracks public and private playlists. music playlists: Survey and experiments. ACM Computing Surveys, 47(2), 2014. [4] S. Chen, J. L. Moore, D. Turnbull, and T. Joachims. sharing a list of current top hits might be of limited value Playlist Prediction via Metric Embedding. In Proc. for other platform members who might be generally more KDD, pages 714–722, 2012. interested in discovering not so popular artists and tracks. [5] S. Cunningham, D. Bainbridge, and A. Falconer. Second, users might deliberately share playlists with less ‘More of an Art than a Science’: Supporting the popular or known artists and tracks to create a social image Creation of Playlists and Mixes. In Proc. ISMIR, on the platform. pages 240–245, 2006. Given these first observations, we believe that our ap- [6] A. Flexer, D. Schnitzer, M. Gasser, and G. Widmer. proach has some potential to help us better understand some Playlist Generation Using Start and End Songs. In elements of user behavior on social platforms in general, Proc. ISMIR, pages 173–178, 2008. i.e., that people might not necessarily only share tracks that [7] D. L. Hansen and J. Golbeck. Mixing It Up: match their actual taste. Recommending Collections of Items. In Proc. CHI, pages 1217–1226, 2009. 6. SUMMARY AND OUTLOOK [8] N. Hariri, B. Mobasher, and R. Burke. Context-Aware The goal of our work is to gain a better understanding Music Recommendation Based on Latent Topic of how users create playlists in order to be able to design Sequential Patterns. In Proc. RecSys, pages 131–138, future playlisting algorithms that take these “natural” char- 2012. acteristics into account. The first results reported in this [9] M. Kamalzadeh, D. Baur, and T. Möller. A Survey on paper indicate, for example, that features like track fresh- Music Listening and Management Behaviours. In ness, popularity aspects, or homogeneity of the tracks are Proc. ISMIR, pages 373–378, 2012. relevant for users, but not yet fully taken into account by [10] A. Lehtiniemi and J. Seppänen. Evaluation of current algorithms that are considered to create high-quality Automatic Mobile Playlist Generator. In Proc. MC, playlists in the literature. Overall, the observations also in- pages 452–459, 2007. dicate that additional metrics might be required to assess the quality of computer-generated playlists in experimental [11] B. McFee and G. R. Lanckriet. The Natural Language settings that are based on historical data such as existing of Playlists. In Proc. ISMIR, pages 537–542, 2011. playlists or listening logs. [12] G. Reynolds, D. Barry, T. Burke, and E. Coyle. Given the richness of the available data, many more analy- Interacting With Large Music Collections: Towards ses are possible. Currently, we are exploring “semantic” char- the Use of Environmental Metadata. In Proc. ICME, acteristics to automatically identify the underlying theme pages 989–992, 2008. or topic of the playlists. Another aspect not considered so [13] A. M. Sarroff and M. Casey. Modeling and Predicting far in our research is the popularity of the playlists. For Song Adjacencies In Commercial Albums. In Proc. some music platforms, listening counts and “like” statements SMC, 2012. for playlists are available. This additional information can [14] M. Slaney and W. White. Measuring Playlist Diversity be used to further differentiate between “good” and “bad” for Recommendation Systems. In Proc. AMCMM, playlists and help us obtain more fine-granular differences pages 77–82, 2006. with respect to the corresponding playlist characteristics.