=Paper= {{Paper |id=None |storemode=property |title= Analyzing the Characteristics of Shared Playlists for Music Recommendation |pdfUrl=https://ceur-ws.org/Vol-1271/Paper1.pdf |volume=Vol-1271 |dblpUrl=https://dblp.org/rec/conf/recsys/JannachKB14 }} == Analyzing the Characteristics of Shared Playlists for Music Recommendation == https://ceur-ws.org/Vol-1271/Paper1.pdf

Analyzing the Characteristics of Shared Playlists for
Music Recommendation

Dietmar Jannach Iman Kamehkhosh Geoffray Bonnin
TU Dortmund, Germany TU Dortmund, Germany TU Dortmund, Germany
dietmar.jannach@tu-dortmund.de iman.kamehkhosh@tu-dortmund.de geoffray.bonnin@tu-dortmund.de

ABSTRACT 1. INTRODUCTION
The automated generation of music playlists – as supported The automated creation of playlists or personalized radio
by modern music services like last.fm or Spotify – represents stations is a typical feature of today’s online music plat-
a special form of music recommendation. When designing forms and music streaming services. In principle, standard
a “playlisting” algorithm, the question arises which kind of recommendation algorithms based on collaborative filtering
quality criteria the generated playlists should fulfill and if or content-based techniques can be applied to generate a
there are certain characteristics like homogeneity, diversity ranked list of musical tracks given some user preferences
or freshness that make the playlists generally more enjoyable or past listening history. For several reasons, the gener-
for the listeners. In our work, we aim to obtain a better un- ation of playlists however represents a very specific music
derstanding of such desired playlist characteristics in order recommendation problem. Personal playlists are, for exam-
to be able to design better algorithms in the future. The ple, often created with a certain goal or usage context (e.g.,
research approach chosen in this work is to analyze several sports, relaxation, driving) in mind. Furthermore, in con-
thousand playlists that were created and shared by users on trast to relevance-ranked recommendation lists used in other
music platforms based on musical and meta-data features. domains, playlists typically obey some homogeneity and co-
Our first results for example reveal that factors like pop- herence criteria, i.e., there are quality characteristics that
ularity, freshness and diversity play a certain role for users are related to the transitions between the tracks or to the
when they create playlists manually. Comparing such user- playlist as a whole.
generated playlists with automatically created ones more- In the research literature, a number of approaches for the
over shows that today’s online playlisting services sometimes automation of the playlist generation process have been pro-
generate playlists which are quite different from user-created posed, see, e.g., [2, 6, 8, 10, 11] or the recent survey in
ones. Finally, we compare the user-created playlists with [3]. Some of them for example take a seed song or artist
playlists generated with a nearest-neighbor technique from as an input and look for similar tracks; others try to find
the research literature and observe even stronger differences. track co-occurrence patterns in existing playlists. In some
This last observation can be seen as another indication that approaches, playlist generation is considered as an optimiza-
the accuracy-based quality measures from the literature are tion problem. Independent of the chosen technique, a com-
probably not sufficient to assess the effectiveness of playlist- mon problem when designing new playlisting algorithms is
ing algorithms. to assess whether or not the generated playlists will be posi-
tively perceived by the listeners. User studies and online ex-
Categories and Subject Descriptors periments are unfortunately particularly costly in the music
domain. Researchers therefore often use offline experimen-
H.3.3 [Information Storage and Retrieval]: Information tal designs and for example use existing playlists shared by
Search and Retrieval; H.5.5 [Information Interfaces and users on music platforms as a basis for their evaluations. The
Presentation]: Sound and Music Computing assumption is that these “hand-crafted” playlists are of good
quality; typical measures used in the literature include the
General Terms Recall [8] or the Average Log-Likelihood (ALL) [11]. Un-
Playlist generation, Music recommendation fortunately, both measures have their limitations, see also
[2]. The Recall measure for example tells us how good an
Keywords algorithm is at predicting the tracks selected by the users,
Music, playlist, analysis, algorithm, evaluation but does not explicitly capture specific aspects such as the
homogeneity or the smoothness of track transitions.
To design better and more comprehensive quality mea-
sures, we however first have to answer the question of what
users consider to be desirable characteristics of playlists or
what the driving principles are when users create playlists.
In the literature, a few works have studied this aspect using
Proceedings of the 6th Workshop on Recommender Systems and the Social Web different approaches, e.g., user studies [1, 7] or analyzing fo-
(RSWeb 2014), collocated with ACM RecSys 2014, 10/06/2014, Foster City, CA, USA. rum posts [5]. The work presented in this paper continues
Copyright held by the authors. these lines of research. Our research approach is however
.
different from previous works as we aim to identify patterns Reynolds et al. [12] made an online survey that revealed
in a larger set of manually created playlists that were shared that the context and environment like the location activity
by users of three different online music platforms. To be able or the weather can have an influence both on the listeners’
to take a variety of potential driving factors into account in mood and on the track selection behavior of playlist cre-
our analysis, we have furthermore collected various types of ators. Finally, the study presented in [9] again confirmed
meta-data and musical features of the playlist tracks from the importance of artists, genres and mood in the playlist
public music databases. creation process.
Overall, with our analyses we hope to obtain insights on In this discussion, we have focused on previous attempts
the principles which an automated playlist generation sys- to understand how users create playlists and what their char-
tem should observe to end up with better-received or more acteristics are. Playlist generation algorithms however do
“natural” playlists. To test if current music services and not necessarily have to rely on such knowledge. Instead,
a nearest-neighbor algorithm from the literature generate one can follow a statistical approach and only look at co-
playlists that observe the identified patterns and make sim- occurrences and transitions of tracks in existing playlists and
ilar choices as real users, we conducted an experiment in use these patterns when creating new playlists, see e.g., [2]
which we analyzed commonalities and differences between or [4]. This way, the quality factors respected by human
automatically generated and user-provided playlists. playlist creators are implicitly taken into account. Such
Before reporting the details of our first analyses, we will approaches, however, cannot be directly applied for many
first discuss previous works in the next section. types of playlist generation settings, e.g., for creating “the-
matic” playlists (e.g., Christmas Songs) or for creating play-
lists that only contain tracks that have certain musical fea-
2. PREVIOUS WORKS tures. Pure statistical methods are not aware of these char-
acteristics and the danger exists that tracks are included
In [14], Slaney and White addressed the question if users
that do not match the purpose of the list and thus lead to
have a tendency to create very homogeneous or rather di-
a limited overall quality.
verse playlists. As a basis for determining the diversity they
relied on an objective measure based on genre information
about the tracks. Each track was considered as a point in 3. CHARACTERISTICS OF PLAYLISTS
the genre space and the diversity was then determined by The ultimate goal of our research is to analyze the struc-
calculating the volume of an ellipsoid enclosing the tracks of ture and characteristics of playlists in order to better under-
the playlist. An analysis of 887 user-created playlists indi- stand the principles used by the users to create them. This
cated that diversity can be considered to be a driving factor section is a first step toward this goal.
as users typically create playlists covering several genres.
Sarroff and Casey more recently [13] focused on track tran- 3.1 Data sources
sitions in album playlists and made an analysis to determine As a basis for the first analyses that we report in this
if there are certain musical characteristics that are particu- paper, we used two types of playlist data.
larly important. One of the results of their investigation was
that fade durations and the mean timbre of the beginnings 3.1.1 Hand-crafted playlists
and endings of consecutive tracks seem to have a strong in- We used samples of hand-crafted playlists from three dif-
fluence on the ordering of the tracks. ferent sources. One set of playlists was retrieved via the
Generally, our work is similar to [14] and [13] in that we public API of last.fm1 , one was taken from the Art of the
rely on user-created (“hand-crafted”) playlists and look at Mix (AotM) website2 , and a third one was provided to us by
meta-data and musical features of the tracks to identify po- 8tracks3 . To enhance the data quality, we corrected artist
tentially important patterns. The aspects we cover in this misspellings using the API of last.fm.
paper were however not covered in their work and our anal- Overall, we analyzed over 10,000 playlists containing about
ysis is based on larger datasets. 108,000 different tracks of about 40,000 different artists. As
Cunningham et al., [5], in contrast, relied on another form a first attempt toward our goal, we retrieved the features
of track-related information and looked at the user posts in listed in Table 1 using the public API of last.fm and The
the forum of the Art of the Mix web site. According to their Echo Nest (tEN), and the MusicBrainz database.
analysis, the typical principles for setting up the playlists Some dataset characteristics are shown in Table 2. The
mentioned by the creators were related to the artist, genre, “usage count” statistics express how often tracks and artists
style, event or activity but also the intended purpose, con- appeared overall in the playlists. When selecting the playlists,
text or mood. Some users also talked about the smoothness we made sure that they do not simply contain album list-
of track transitions and how many tracks of one single artist ings. The datasets are partially quite different, e.g., with
should be included in playlists. Placing the most “impor- respect to the average playlist lengths. The 8tracks dataset
tant” track at the end of a playlist was another strategy furthermore has the particularity that users are not allowed
mentioned by some of the playlist creators. to include more than two tracks of one artist, in case they
A different form of identifying playlist creation principles want to share their playlist with others.
is to conduct laboratory studies with users. The study re- Figure 1 shows the distributions of playlist lengths. As
ported in [7] for example involved 52 subjects and indicated can be seen, the distributions are quite different across the
that the first and the last tracks can play an important role datasets. On 8tracks, a playlist generally has to comprise
for the quality of a playlist. In another study, Andric and
1
Haus [1] concluded that the ordering of tracks is not im- http://www.last.fm
2
portant when the playlist mainly contains tracks which the http://www.artofthemix.org
3
users like in general. http://8tracks.com
at least 8 tracks. The lengths of the last.fm playlists seem
Source Information Description to follow a normal distribution with a maximum frequency
last.fm Tags Top tags assigned by users to value at around 20 tracks. Finally, the sizes of the AotM
the track. playlists are much more equally distributed.
last.fm Playcounts Total number of times the
users played the track.
3.1.2 Generated playlists
tEN Genres Genres of the artist of the To assess if the playlists generated by today’s online ser-
track. Multiple genres can be vices are similar to those created by users, we used the public
assigned to a single artist. API of The Echo Nest. We chose this service because it uses
tEN Danceability Suitability of the track for a very large database and allows the generation of playlists
dancing, based on various in- from several seed tracks, as opposed to, for instance, iTunes
formation including the beat Genius or last.fm radios. We split the existing hand-crafted
strength and the stability of playlists in half, provided the first half of the list as seed
the tempo. tracks to the music service and then analyzed the character-
tEN Energy Intensity released throughout istics of the playlist returned by The Echo Nest and com-
the track, based on various in- pared them to the patterns that we found in hand-crafted
formation including the loud- playlists. Instead of observing whether a playlister gener-
ness and segment durations. ates playlists that are generally similar to playlists created
by hand, our goal here is to break down their different char-
tEN Loudness Overall loudness of the track
acteristics and observe on what specific dimensions they dif-
in decibels (dB).
fer. Notice that using the second half as seed would not be
tEN Tempo Speed of the track estimated
appropriate as the order of the tracks may be important.
in beats per minute (BPM).
We also draw our attention to the ability of the algorithms
tEN Hotttnesss Current reputation of the
of the literature to reproduce the characteristics of hand-
track based on its activity on
crafted playlists. According to some recent research, one of
some web sites crawled by the
the most competitive approaches in terms of recall is the
developers.
simple k-nearest-neighbors (kNN) method [2, 8]. More pre-
MB Release year Year of release of the corre- cisely, given some seed tracks, the algorithm extracts the k
sponding album. most similar playlists based on the number of shared items
and recommends the tracks of these playlists. This algo-
Table 1: Additional retrieved information.
rithm does not require a training step and scans the entire
set of available playlists for each recommendation.

3.2 Detailed observations
lastfm AotM 8tracks
In the following sections, we will look at general distribu-
Playlists 1,172 5,043 3,130
tions of different track characteristics.
Tracks 24,754 61,935 29,732
Artists 9,925 23,029 13,379 3.2.1 Popularity of tracks
Avg. tracks/playlist 26.0 19.7 12.5
The goal of the first analysis here is to determine if users
Avg. artists/playlist 16.8 17.8 11.5
tend to position tracks in playlists depending on their pop-
Avg. genres/playlist 2.7 3.5 3.4
ularity. In our analysis, we measure the popularity in terms
Avg. tags/playlist 473.4 418.7 297.4
of play counts. Play counts were taken from last.fm, be-
Avg. track usage count 1.2 1.6 1.3
cause this is one of the most popular services and the cor-
Avg. artist usage count 3.0 4.3 2.9
responding values can be considered indicative for a larger
user group.
Table 2: Some basic statistics of the datasets. For the measurement, we split the playlists into two parts
of equal size and then determined the average play counts on
last.fm for the tracks for each half. To measure to which ex-
tent the user community favors certain tracks in the playlists,
1200 we calculated the Gini index, a standard measure of inequal-
Aotm ity4 . Table 3 shows the results. In the last column, we re-
1000
last.fm port the statistics for the tracks returned by The Echo Nest
Frequencies

800
(tEN) and kNN playlisters5 . We provided the first half of
8tracks
600 the hand-crafted playlists as seed tracks and the playlisters
400 had to select the same number of tracks as the number of
remaining tracks.
200
The results show that users actually tend to place more
0 popular items in the first part of the list in all datasets,
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 when play counts are considered. The Echo Nest playlister
Playlist sizes does not seem to take that form of popularity into account
4
We organized the average play counts in 100 bins.
Figure 1: Distribution of playlists sizes. 5
We determined 10 as the best neighborhood size for our
data sets based on the recall value, see Section 4.
Play counts 1st half 2nd half tEN measure, we compared the creation year of each playlist with
last.fm 1,007k 893k 629k the average release year of its tracks. We limit our analysis
AotM 671k 638k 606k to the last.fm and 8tracks datasets because we only could
8tracks 953k 897k 659k acquire creation dates for these two.
Gini index 1st half 2nd half tEN 0.18
last.fm 0.06 0.04 0.04 0.16

Relative frequency
AotM 0.20 0.18 0.22 0.14
8tracks 0.09 0.09 0.08 8tracks
0.12
0.1
last.fm
Play counts 1st half 2nd half kNN 0.08
last.fm 1,110k 943k 1,499k 0.06
AotM 645k 617k 867k 0.04
8tracks 1,008k 984k 1,140k 0.02
0
Gini index 1st half 2nd half kNN
0 5 10 15 20 25 30
last.fm 0.12 0.09 0.33
Average freshness of playlists (years)
AotM 0.26 0.23 0.43
8tracks 0.15 0.12 0.28
Figure 2: Distribution of average freshness of
Table 3: Popularity of tracks in playlists (last.fm playlists (comparing playlist creation date and track
play counts) and concentration bias (Gini coeffi- release date).
cient).
Figure 2 shows the statistics for both datasets. We orga-
nized the data points in bins (x-axis), where each bin repre-
and recommends on average less popular tracks. These dif- sents an average-freshness level, and then counted how many
ferences are statistically significant according to a Student’s playlists fall into these levels. The relative frequencies are
t-test (p < 10−5 for The Echo Nest playlister and p < 10−7 shown on the y-axis. The result are very similar for both
for the kNN playlister). This behavior indicates also that datasets, with a slight tendency to include older tracks for
The Echo Nest is successfully replicating the fact that the last.fm. On both datasets, more than half of the playlists
second halves of playlists are supposed to be less popular contain tracks that were released on average in the last 5
than the first half. years, the most frequent average age being between 4 and
The Gini index reveals that there is a slightly stronger con- 5 years for last.fm and between 3 and 4 years for 8tracks.
centration on some tracks in the first half for two of three Similarly, on both datasets, more than 75% of the playlists
datasets and the diversity slightly increases in the second contain tracks that were released on average in the last 8
part. The absolute numbers cannot be directly compared years.
across datasets, but for the AotM dataset the concentra- We also analyzed the standard deviation of the resulting
tion is generally much higher, which is also indicated by the freshness values and observed that more than half of the
higher “track reuse” in Table 2. Interestingly, The Echo Nest playlists have a standard deviation of less than 4 (years),
playlister quite nicely reproduces the behavior of real users while more than 75% have a standard deviation of less than 7
with respect to the diversity of popularity. (years) on both datasets. Overall, this suggests that playlists
In the lower part of Table 3, we show the results for made by users are often homogeneous with regard to the
the kNN method. Note that these statistics are based on release date.
a different sample of the playlists than the previous mea- Computing the freshness for the generated playlists would
surement. The reason is that both The Echo Nest and the require to configure the playlisters in such a way that they
kNN playlisters cannot produce playlists for all of the first select only tracks that were not released after the playlists’
halves provided as seed tracks. We therefore considered only creation years. Unfortunately, The Echo Nest does not allow
playlists, for which the corresponding algorithm could pro- such a configuration. Moreover, for the kNN approach, the
duce a playlist. playlists that are more recent would have to be ignored,
Unlike the playlister of The Echo Nest, the kNN method which would lead to a too small sample size and not very
has a strong trend to recommend mostly very popular items. reliable results anymore.
This can be caused by the fact that the kNN method by
design recommends tracks that are often found in similar 3.2.3 Homogeneity and diversity
playlists. Moreover, based on the lower half of Table 3, the Homogeneity and diversity can be determined in a variety
popularity correlates strongly with the seed track popularity. of ways. In the following, we will use simple measures based
As a result, the kNN shows a potentially undesirable trend on artist and genre counts. The genres correspond to the
to reinforce already popular items to everyone. At the same genres of the artists of the tracks retrieved from The Echo
time, it concentrates the track selection on a comparable Nest. Basic figures for artist and genre diversity are already
small number of tracks as indicated by the very high value given in Table 2. On AotM, for example, having several
for the Gini coefficient. tracks of an artist in a playlist is not very common6 . On
last.fm, we in contrast very often see two or more tracks of
3.2.2 The role of freshness
Next, we analyzed if there is a tendency of users to create 6
On 8tracks, artist repetitions are limited due to license con-
playlists that mainly contain recently released tracks. As a straints
0.25
one artist in a playlist. A similar, very rough estimate can
Energy [0,1]
be made for the genre diversity. If we ordered the tracks of 0.2
a playlist by genre, we would encounter a different genre on Hotttnesss [0,1]

Relative frequency
last.fm only after having listened to about 10 tracks. On 0.15 Loudness [-100,100]
AotM and 8tracks, in contrast, playlists on average cover Danceability [0,1]
more genres. 0.1
Tempo [0,500]
Table 4 shows the diversities of the first and second halves
0.05
of the hand-crafted playlists, and for the automatic selec-
tions using the first halves as seeds. As a measure of di- 0
versity, we simply counted the number of artists and genres 0 20 40 60 80 100
and divided by the corresponding number of tracks. The Scale
values in Table 4 correspond the averages of these diversity
measures. Figure 3: Distribution of The Echo Nest track mu-
sical features independently of playlists.
1st half 2nd half tEN
last.fm artists 0.74 0.76 0.93
genres 2.26 2.30 2.12 0.16 8tracks - Energy
AotM artists 0.93 0.93 0.94
0.14 AotM - Energy
genres 3.26 3.22 2.41
8tracks artists 0.97 0.98 0.99 last.fm - Energy
0.12
genres 3.74 3.85 2.89 8tracks - Hotttnesss

Relative frequency
0.1 AotM - Hotttnesss
1st half 2nd half kNN last.fm - Hotttnesss
0.08
last.fm artists 0.74 0.76 0.87
genres 2.32 2.26 3.11 0.06
AotM artists 0.94 0.94 0.91
0.04
genres 3.27 3.21 3.70
8tracks artists 0.97 0.98 0.93 0.02
genres 3.94 3.92 4.06
0
0 0.2 0.4 0.6 0.8 1
Table 4: Diversity of playlists (Number of artists Energy and Hotttnesss
and genres divided by the corresponding number of
tracks).
Figure 4: Distribution of mean energy and “hottt-
nesss” levels in playlists.
Regarding the diversity of the hand-crafted playlists, the
tables show that users tend to keep a same level of artist and
genre diversity throughout the playlists. We can also notice To understand if people tend to place tracks with specific
that the playlists of last.fm are much more homogeneous. feature values into their playlists, we then computed the
The diversity values of the automatic selections reveal sev- distribution of the average feature values of each playlist.
eral things. First, The Echo Nest playlister tends to always Figure 4 shows the results of this measurement for the en-
maximize the artist diversity independently of the diversity ergy and “hotttnesss” features. For all the other features
of the seeds; on the contrary, the kNN playlister lowered the (danceability, loudness and tempo), the distributions were
initial artist diversities, except on the last.fm dataset, where similar to those of Figure 3, which could mean that they are
it increased them, though less than The Echo Nest playlister. generally not particularly important for the users.
Regarding the genre diversity, we can observe an opposite When looking at the energy feature, we see that users tend
tendency for both playlisters: The Echo Nest playlister tends to include tracks from a comparably narrow energy spectrum
to reduce the genre diversity while the kNN playlister tends with a low average energy level, even though there exist
to increase it. Again, these difference are statistically signif- more high-energy tracks in general as shown in Figure 3. A
icant (p < 0.03 for The Echo Nest playlister and p < 0.006 similar phenomenon of concentration on a certain range of
for the kNN playlister). Overall, the resulting diversities of values can be observed for the “hotttnesss” feature. As a
the both approaches tend to be rather dissimilar to those of side aspect, we can observe that the tracks shared on AotM
the hand-crafted playlists. are on average slightly less “hottt” than those of both other
platforms7 .
3.2.4 Musical features (The Echo Nest) We finally draw our attention to the feature distributions
Figure 3 shows the overall relative frequency distribution of the generated playlists. Figure 5 as an example shows
of the numerical features from The Echo Nest listed in Ta- the distributions of the energy and “hotttnesss” factors for
ble 1 for the set of tracks appearing in our playlists on a 7
normalized scale. For the loudness feature, for example, we The results for the “hotttnesss” we report here correspond
see that most tracks have values between 40 and 50 on the to the values at the time when we retrieved the data using
the API of The Echo Nest, and not to those at the time when
normalized scale. This would translate into an actual loud- the playlists were created. This is not important as we do
ness value of -20 to 0 returned by The Echo Nest, given that not look at the distributions independently, but compare
the range is -100 to 100. them to the distributions in Figure 3.
0.1
1st half 1st half 2nd half tEN
0.09
2nd half last.fm artists 0.19 0.18 0
0.08
tEN genres 0.43 0.40 0.56
0.07
Relative frequeny

0.06 kNN10 energy 0.76 0.71 0.77
0.05
hotttnesss 0.81 0.76 0.83
0.04 AotM artists 0.05 0.05 0
0.03 genres 0.24 0.22 0.50
0.02 energy 0.75 0.74 0.75
0.01 hotttnesss 0.83 0.82 0.85
0 8tracks artists 0.02 0.01 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Energy
genres 0.22 0.22 0.52
0.25
energy 0.73 0.71 0.76
1st half
hotttnesss 0.81 0.79 0.85
2nd half
0.2
tEN 1st half 2nd half kNN
Relative frequency

kNN10
0.15 last.fm artists 0.22 0.21 0.02
genres 0.44 0.42 0.14
0.1 energy 0.76 0.76 0.75
hotttnesss 0.83 0.82 0.83
0.05
AotM artists 0.05 0.05 0.03
0
genres 0.22 0.21 0.13
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 energy 0.75 0.74 0.73
Hotttnesss hotttnesss 0.83 0.82 0.84
8tracks artists 0.02 0.01 0.03
Figure 5: Comparison of the distribution of energy genres 0.22 0.22 0.17
and “hotttnesss” levels for hand-crafted and gener- energy 0.74 0.73 0.74
ated playlists. hotttnesss 0.82 0.80 0.84

Table 5: Coherence of first, second and generated
the first halves and second halves of the playlists of all three halves.
datasets, together with the distributions of the tracks se-
lected by The Echo Nest and kNN playlisters.
The figure shows that The Echo Nest playlister tends to
produce a distribution that is quite similar to the distribu- Another interesting phenomenon is the high artist coher-
tion of the seed tracks. The kNN playlister, in contrast, ence values on the last.fm dataset. These values indicate
tends to concentrate the distributions toward the maximum that last.fm users have a surprisingly strong tendency to
values of the distributions of the seeds. We could observe group tracks from the same artist together, which was not
this phenomenon of concentration for all the features on successfully reproduced by the two playlisters. Both playlis-
all three datasets, except for the danceability on the AotM ters actually seem to have a tendency to produce always
dataset. the same coherence values, independently of the coherence
values of the seed. A last interesting result is the high co-
3.2.5 Transitions and Coherence herence of artist genres on the AotM and 8tracks datasets –
We now focus on the importance of transitions between the high genre coherence values on last.fm can be explained
the tracks, and define the coherence of a playlist as the av- by the high artist coherence values.
erage similarity between its consecutive tracks. Such simi-
larities can be computed according to various criteria. We
used the binary cosine similarity of the genres and artists8 ,
and the Euclidean linear similarity for the numerical track
features of The Echo Nest. Table 5 shows the corresponding 4. STANDARD ACCURACY METRICS
results for the first and second halves of the hand-crafted
playlists, and for the automatic selections using the first Our analysis so far has revealed some particular charac-
halves as seeds. teristics of user-created playlists. Furthermore, we observed
We can first see that for all datasets and for all criteria, the that the nearest-neighbor playlisting scheme can produce
second halves of the playlists have a lower coherence than playlists that are quite different to those generated by the
the first halves. If we assume that the coherence is repre- commercial Echo Nest service, e.g., in terms of average track
sentative of the effort of the users to create good playlists, popularity (Table 3).
then the tracks of the second halves seem to be slightly less In the research literature, “hit rates” (recall) and the av-
carefully selected than those of the first halves. erage log-likelihood (ALL) are often used to compare the
quality of playlists generated by different algorithms [2, 8,
8
In the case of artists, this means that the similarity equals 11]. The goal of our next experiment was to find out how
1 if both tracks have the same artist, and 0 else. The met- The Echo Nest playlister performs on these measures. As
ric thus measures the proportion of cases when the users it is not possible to acquire probability values for the tracks
consecutively selected tracks from the same artist. selected by The Echo Nest playlister, the ALL cannot be
used9 . In the following we thus only focus on the precision With respect to the evaluation protocol, note that we only
and recall. measured precision and recall when the playlister was able to
The upper part of Figure 6 shows the recall values at return a playlist continuation given the seed tracks. This was
list length 100 for the different datasets10 . Again, we split however not always the case for both techniques. In Table 6,
the playlists and used the first half as seed tracks. Recall we therefore report the detailed coverage figures, which show
was then computed by comparing the computed playlists that the kNN method was more often able to produce a
with the “hidden” tracks of the original playlist. We mea- playlist. If recall is measured for all seed playlists, the dif-
sured recall for tracks, artists, genres and tags. The results ferences between the algorithms are even larger. When mea-
show that the kNN method quite clearly outperforms the suring precision for all playlists, the differences between the
playlister of The Echo Nest on the recall measures across all playlisters become very small.
datasets except for the artist recall for the last.fm dataset.
The differences are statistically significant for all the ex- Dataset tEN kNN
periments except for the track and artists recall on last.fm last.fm 28.33 66.89
(p < 10−6 ) according to a Student’s t-test. As expected, AotM 42.75 86.52
the kNN method leads to higher absolute values for larger 8tracks 35.3 43.8
datasets as more neighbors can be found.
Table 6: Coverage of the playlisters.
0.8
0.7
Overall, measuring precision and recall when comparing
0.6
0.5
generated playlists with those provided by users in our view
0.4 represents only one particular form of assessing the quality
0.3 of a playlist generator and should be complemented with
0.2 additional measures. Precision and recall as measured in our
0.1 experiments for example do not consider track transitions.
0
There is also no “punishment” if a generated playlist contains
the Echo kNN10 the Echo kNN10 the Echo kNN10
Nest Nest Nest individual non-fitting tracks that would hurt the listener’s
overall enjoyment.
last.fm AotM 8tracks
track recall artist recall genre recall tag recall
0.5
5. PUBLIC AND PRIVATE PLAYLISTS
0.45 Some music platforms and in particular 8tracks let their
0.4
0.35
users create “private” playlists which are not visible to oth-
0.3 ers and public ones that for example are shared and used
0.25 for social interaction like parties, motivation for team sport
0.2
0.15 or romantic evening. The question arises if public playlists
0.1 have different characteristics than those that were created
0.05
0
for personal use only, e.g., because sharing playlists to some
the Echo kNN10 the Echo kNN10 the Echo kNN10 extent can also serve the purpose of creating a public image
Nest Nest Nest of oneself.
last.fm AotM 8tracks We made an initial analysis on the 8tracks dataset. Ta-
track precision artist precision genre precision tag precision
ble 7 shows the average popularity of the tracks in the 8tracks
playlists depending on whether they were in “public” or “pri-
vate” playlists (the first category contains 2679 playlists and
Figure 6: Recall and Precision for the covered cases. the second 451). As can be seen, the tracks of the private
playlists are much more popular on average than the tracks
The lower part of Figure 6 presents the precision results. in the public playlists. Moreover, as indicated by the cor-
The precision values for tracks are as expected very low and responding Gini coefficients, the popular tracks are almost
close to zero which is caused by the huge set of possible equally distributed across the playlists. Furthermore, Fig-
tracks and the list length of 100. We can however observe a ure 7 shows the corresponding freshness values. We can see
higher precision for the kNN method on the AotM dataset that the private playlists generally contained more recent
(p < 10−11 ), which is the largest dataset. Regarding artist, tracks than public playlists.
genre and tag prediction, The Echo Nest playlister lead to
a higher precision (p < 10−3 ) than the kNN playlister on all Play counts Gini index
datasets. Public playlists 870k 0.20
Private playlists 935k 0.06
9
Another possible measure is the Mean Reciprocal Rank
(MRR). Applied to playlist generation, one limitation of Table 7: Popularity of tracks in 8tracks public and
this metric is that it corresponds to the assumption that private playlists and Gini index.
the rank of the test track or artist to predict should be as
high as possible in the recommendation list, although many
other tracks or artist may be more relevant and should be These results can be interpreted at least in two different
ranked before. ways. First, users might create some playlists for their per-
10 sonal use to be able to repeatedly listen to the latest popular
We could not measure longer list lengths as 100 is the max-
imum playlist length returned by The Echo Nest. tracks. They probably do not share these playlists because
0.16
Last, we plan to extend our experiments and analysis by
0.14 considering other music services, in particular last.fm radios,
Relative frequency
Public playlists
0.12 and other playlisting algorithms, in particular algorithms
0.1 Private playlists that exploit content information.
0.08
0.06 7. REFERENCES
0.04 [1] A. Andric and G. Haus. Estimating Quality of
0.02 Playlists by Sight. In Proc. AXMEDIS, pages 68–74,
0 2005.
0 5 10 15 20 25 30 [2] G. Bonnin and D. Jannach. Evaluating the Quality of
Average freshness of 8tracks playlists (years) Playlists Based on Hand-Crafted Samples. In Proc.
ISMIR, pages 263–268, 2013.
Figure 7: Distribution of average freshness of [3] G. Bonnin and D. Jannach. Automated generation of
8tracks public and private playlists. music playlists: Survey and experiments. ACM
Computing Surveys, 47(2), 2014.
[4] S. Chen, J. L. Moore, D. Turnbull, and T. Joachims.
sharing a list of current top hits might be of limited value Playlist Prediction via Metric Embedding. In Proc.
for other platform members who might be generally more KDD, pages 714–722, 2012.
interested in discovering not so popular artists and tracks. [5] S. Cunningham, D. Bainbridge, and A. Falconer.
Second, users might deliberately share playlists with less ‘More of an Art than a Science’: Supporting the
popular or known artists and tracks to create a social image Creation of Playlists and Mixes. In Proc. ISMIR,
on the platform. pages 240–245, 2006.
Given these first observations, we believe that our ap-
[6] A. Flexer, D. Schnitzer, M. Gasser, and G. Widmer.
proach has some potential to help us better understand some
Playlist Generation Using Start and End Songs. In
elements of user behavior on social platforms in general,
Proc. ISMIR, pages 173–178, 2008.
i.e., that people might not necessarily only share tracks that
[7] D. L. Hansen and J. Golbeck. Mixing It Up:
match their actual taste.
Recommending Collections of Items. In Proc. CHI,
pages 1217–1226, 2009.
6. SUMMARY AND OUTLOOK [8] N. Hariri, B. Mobasher, and R. Burke. Context-Aware
The goal of our work is to gain a better understanding Music Recommendation Based on Latent Topic
of how users create playlists in order to be able to design Sequential Patterns. In Proc. RecSys, pages 131–138,
future playlisting algorithms that take these “natural” char- 2012.
acteristics into account. The first results reported in this [9] M. Kamalzadeh, D. Baur, and T. Möller. A Survey on
paper indicate, for example, that features like track fresh- Music Listening and Management Behaviours. In
ness, popularity aspects, or homogeneity of the tracks are Proc. ISMIR, pages 373–378, 2012.
relevant for users, but not yet fully taken into account by
[10] A. Lehtiniemi and J. Seppänen. Evaluation of
current algorithms that are considered to create high-quality
Automatic Mobile Playlist Generator. In Proc. MC,
playlists in the literature. Overall, the observations also in-
pages 452–459, 2007.
dicate that additional metrics might be required to assess
the quality of computer-generated playlists in experimental [11] B. McFee and G. R. Lanckriet. The Natural Language
settings that are based on historical data such as existing of Playlists. In Proc. ISMIR, pages 537–542, 2011.
playlists or listening logs. [12] G. Reynolds, D. Barry, T. Burke, and E. Coyle.
Given the richness of the available data, many more analy- Interacting With Large Music Collections: Towards
ses are possible. Currently, we are exploring “semantic” char- the Use of Environmental Metadata. In Proc. ICME,
acteristics to automatically identify the underlying theme pages 989–992, 2008.
or topic of the playlists. Another aspect not considered so [13] A. M. Sarroff and M. Casey. Modeling and Predicting
far in our research is the popularity of the playlists. For Song Adjacencies In Commercial Albums. In Proc.
some music platforms, listening counts and “like” statements SMC, 2012.
for playlists are available. This additional information can [14] M. Slaney and W. White. Measuring Playlist Diversity
be used to further differentiate between “good” and “bad” for Recommendation Systems. In Proc. AMCMM,
playlists and help us obtain more fine-granular differences pages 77–82, 2006.
with respect to the corresponding playlist characteristics.