Analyzing the Characteristics of Shared Playlists for
                       Music Recommendation

                   Dietmar Jannach                                 Iman Kamehkhosh                         Geoffray Bonnin
                 TU Dortmund, Germany                             TU Dortmund, Germany                  TU Dortmund, Germany
              dietmar.jannach@tu-dortmund.de                  iman.kamehkhosh@tu-dortmund.de          geoffray.bonnin@tu-dortmund.de


ABSTRACT                                                                           1.   INTRODUCTION
The automated generation of music playlists – as supported                            The automated creation of playlists or personalized radio
by modern music services like last.fm or Spotify – represents                      stations is a typical feature of today’s online music plat-
a special form of music recommendation. When designing                             forms and music streaming services. In principle, standard
a “playlisting” algorithm, the question arises which kind of                       recommendation algorithms based on collaborative filtering
quality criteria the generated playlists should fulfill and if                     or content-based techniques can be applied to generate a
there are certain characteristics like homogeneity, diversity                      ranked list of musical tracks given some user preferences
or freshness that make the playlists generally more enjoyable                      or past listening history. For several reasons, the gener-
for the listeners. In our work, we aim to obtain a better un-                      ation of playlists however represents a very specific music
derstanding of such desired playlist characteristics in order                      recommendation problem. Personal playlists are, for exam-
to be able to design better algorithms in the future. The                          ple, often created with a certain goal or usage context (e.g.,
research approach chosen in this work is to analyze several                        sports, relaxation, driving) in mind. Furthermore, in con-
thousand playlists that were created and shared by users on                        trast to relevance-ranked recommendation lists used in other
music platforms based on musical and meta-data features.                           domains, playlists typically obey some homogeneity and co-
  Our first results for example reveal that factors like pop-                      herence criteria, i.e., there are quality characteristics that
ularity, freshness and diversity play a certain role for users                     are related to the transitions between the tracks or to the
when they create playlists manually. Comparing such user-                          playlist as a whole.
generated playlists with automatically created ones more-                             In the research literature, a number of approaches for the
over shows that today’s online playlisting services sometimes                      automation of the playlist generation process have been pro-
generate playlists which are quite different from user-created                     posed, see, e.g., [2, 6, 8, 10, 11] or the recent survey in
ones. Finally, we compare the user-created playlists with                          [3]. Some of them for example take a seed song or artist
playlists generated with a nearest-neighbor technique from                         as an input and look for similar tracks; others try to find
the research literature and observe even stronger differences.                     track co-occurrence patterns in existing playlists. In some
This last observation can be seen as another indication that                       approaches, playlist generation is considered as an optimiza-
the accuracy-based quality measures from the literature are                        tion problem. Independent of the chosen technique, a com-
probably not sufficient to assess the effectiveness of playlist-                   mon problem when designing new playlisting algorithms is
ing algorithms.                                                                    to assess whether or not the generated playlists will be posi-
                                                                                   tively perceived by the listeners. User studies and online ex-
Categories and Subject Descriptors                                                 periments are unfortunately particularly costly in the music
                                                                                   domain. Researchers therefore often use offline experimen-
H.3.3 [Information Storage and Retrieval]: Information                             tal designs and for example use existing playlists shared by
Search and Retrieval; H.5.5 [Information Interfaces and                            users on music platforms as a basis for their evaluations. The
Presentation]: Sound and Music Computing                                           assumption is that these “hand-crafted” playlists are of good
                                                                                   quality; typical measures used in the literature include the
General Terms                                                                      Recall [8] or the Average Log-Likelihood (ALL) [11]. Un-
Playlist generation, Music recommendation                                          fortunately, both measures have their limitations, see also
                                                                                   [2]. The Recall measure for example tells us how good an
Keywords                                                                           algorithm is at predicting the tracks selected by the users,
Music, playlist, analysis, algorithm, evaluation                                   but does not explicitly capture specific aspects such as the
                                                                                   homogeneity or the smoothness of track transitions.
                                                                                      To design better and more comprehensive quality mea-
                                                                                   sures, we however first have to answer the question of what
                                                                                   users consider to be desirable characteristics of playlists or
                                                                                   what the driving principles are when users create playlists.
                                                                                   In the literature, a few works have studied this aspect using
Proceedings of the 6th Workshop on Recommender Systems and the Social Web          different approaches, e.g., user studies [1, 7] or analyzing fo-
(RSWeb 2014), collocated with ACM RecSys 2014, 10/06/2014, Foster City, CA, USA.   rum posts [5]. The work presented in this paper continues
Copyright held by the authors.                                                     these lines of research. Our research approach is however
.
different from previous works as we aim to identify patterns         Reynolds et al. [12] made an online survey that revealed
in a larger set of manually created playlists that were shared    that the context and environment like the location activity
by users of three different online music platforms. To be able    or the weather can have an influence both on the listeners’
to take a variety of potential driving factors into account in    mood and on the track selection behavior of playlist cre-
our analysis, we have furthermore collected various types of      ators. Finally, the study presented in [9] again confirmed
meta-data and musical features of the playlist tracks from        the importance of artists, genres and mood in the playlist
public music databases.                                           creation process.
   Overall, with our analyses we hope to obtain insights on          In this discussion, we have focused on previous attempts
the principles which an automated playlist generation sys-        to understand how users create playlists and what their char-
tem should observe to end up with better-received or more         acteristics are. Playlist generation algorithms however do
“natural” playlists. To test if current music services and        not necessarily have to rely on such knowledge. Instead,
a nearest-neighbor algorithm from the literature generate         one can follow a statistical approach and only look at co-
playlists that observe the identified patterns and make sim-      occurrences and transitions of tracks in existing playlists and
ilar choices as real users, we conducted an experiment in         use these patterns when creating new playlists, see e.g., [2]
which we analyzed commonalities and differences between           or [4]. This way, the quality factors respected by human
automatically generated and user-provided playlists.              playlist creators are implicitly taken into account. Such
   Before reporting the details of our first analyses, we will    approaches, however, cannot be directly applied for many
first discuss previous works in the next section.                 types of playlist generation settings, e.g., for creating “the-
                                                                  matic” playlists (e.g., Christmas Songs) or for creating play-
                                                                  lists that only contain tracks that have certain musical fea-
2.   PREVIOUS WORKS                                               tures. Pure statistical methods are not aware of these char-
                                                                  acteristics and the danger exists that tracks are included
   In [14], Slaney and White addressed the question if users
                                                                  that do not match the purpose of the list and thus lead to
have a tendency to create very homogeneous or rather di-
                                                                  a limited overall quality.
verse playlists. As a basis for determining the diversity they
relied on an objective measure based on genre information
about the tracks. Each track was considered as a point in         3.    CHARACTERISTICS OF PLAYLISTS
the genre space and the diversity was then determined by            The ultimate goal of our research is to analyze the struc-
calculating the volume of an ellipsoid enclosing the tracks of    ture and characteristics of playlists in order to better under-
the playlist. An analysis of 887 user-created playlists indi-     stand the principles used by the users to create them. This
cated that diversity can be considered to be a driving factor     section is a first step toward this goal.
as users typically create playlists covering several genres.
   Sarroff and Casey more recently [13] focused on track tran-    3.1      Data sources
sitions in album playlists and made an analysis to determine        As a basis for the first analyses that we report in this
if there are certain musical characteristics that are particu-    paper, we used two types of playlist data.
larly important. One of the results of their investigation was
that fade durations and the mean timbre of the beginnings          3.1.1    Hand-crafted playlists
and endings of consecutive tracks seem to have a strong in-          We used samples of hand-crafted playlists from three dif-
fluence on the ordering of the tracks.                            ferent sources. One set of playlists was retrieved via the
   Generally, our work is similar to [14] and [13] in that we     public API of last.fm1 , one was taken from the Art of the
rely on user-created (“hand-crafted”) playlists and look at       Mix (AotM) website2 , and a third one was provided to us by
meta-data and musical features of the tracks to identify po-      8tracks3 . To enhance the data quality, we corrected artist
tentially important patterns. The aspects we cover in this        misspellings using the API of last.fm.
paper were however not covered in their work and our anal-           Overall, we analyzed over 10,000 playlists containing about
ysis is based on larger datasets.                                 108,000 different tracks of about 40,000 different artists. As
   Cunningham et al., [5], in contrast, relied on another form    a first attempt toward our goal, we retrieved the features
of track-related information and looked at the user posts in      listed in Table 1 using the public API of last.fm and The
the forum of the Art of the Mix web site. According to their      Echo Nest (tEN), and the MusicBrainz database.
analysis, the typical principles for setting up the playlists        Some dataset characteristics are shown in Table 2. The
mentioned by the creators were related to the artist, genre,      “usage count” statistics express how often tracks and artists
style, event or activity but also the intended purpose, con-      appeared overall in the playlists. When selecting the playlists,
text or mood. Some users also talked about the smoothness         we made sure that they do not simply contain album list-
of track transitions and how many tracks of one single artist     ings. The datasets are partially quite different, e.g., with
should be included in playlists. Placing the most “impor-         respect to the average playlist lengths. The 8tracks dataset
tant” track at the end of a playlist was another strategy         furthermore has the particularity that users are not allowed
mentioned by some of the playlist creators.                       to include more than two tracks of one artist, in case they
   A different form of identifying playlist creation principles   want to share their playlist with others.
is to conduct laboratory studies with users. The study re-           Figure 1 shows the distributions of playlist lengths. As
ported in [7] for example involved 52 subjects and indicated      can be seen, the distributions are quite different across the
that the first and the last tracks can play an important role     datasets. On 8tracks, a playlist generally has to comprise
for the quality of a playlist. In another study, Andric and
                                                                  1
Haus [1] concluded that the ordering of tracks is not im-           http://www.last.fm
                                                                  2
portant when the playlist mainly contains tracks which the          http://www.artofthemix.org
                                                                  3
users like in general.                                              http://8tracks.com
                                                                                at least 8 tracks. The lengths of the last.fm playlists seem
    Source                   Information     Description                        to follow a normal distribution with a maximum frequency
    last.fm                  Tags            Top tags assigned by users to      value at around 20 tracks. Finally, the sizes of the AotM
                                             the track.                         playlists are much more equally distributed.
    last.fm                  Playcounts      Total number of times the
                                             users played the track.
                                                                                3.1.2    Generated playlists
    tEN                      Genres          Genres of the artist of the           To assess if the playlists generated by today’s online ser-
                                             track. Multiple genres can be      vices are similar to those created by users, we used the public
                                             assigned to a single artist.       API of The Echo Nest. We chose this service because it uses
    tEN                      Danceability    Suitability of the track for       a very large database and allows the generation of playlists
                                             dancing, based on various in-      from several seed tracks, as opposed to, for instance, iTunes
                                             formation including the beat       Genius or last.fm radios. We split the existing hand-crafted
                                             strength and the stability of      playlists in half, provided the first half of the list as seed
                                             the tempo.                         tracks to the music service and then analyzed the character-
    tEN                      Energy          Intensity released throughout      istics of the playlist returned by The Echo Nest and com-
                                             the track, based on various in-    pared them to the patterns that we found in hand-crafted
                                             formation including the loud-      playlists. Instead of observing whether a playlister gener-
                                             ness and segment durations.        ates playlists that are generally similar to playlists created
                                                                                by hand, our goal here is to break down their different char-
    tEN                      Loudness        Overall loudness of the track
                                                                                acteristics and observe on what specific dimensions they dif-
                                             in decibels (dB).
                                                                                fer. Notice that using the second half as seed would not be
    tEN                      Tempo           Speed of the track estimated
                                                                                appropriate as the order of the tracks may be important.
                                             in beats per minute (BPM).
                                                                                   We also draw our attention to the ability of the algorithms
    tEN                      Hotttnesss      Current reputation of the
                                                                                of the literature to reproduce the characteristics of hand-
                                             track based on its activity on
                                                                                crafted playlists. According to some recent research, one of
                                             some web sites crawled by the
                                                                                the most competitive approaches in terms of recall is the
                                             developers.
                                                                                simple k-nearest-neighbors (kNN) method [2, 8]. More pre-
    MB                       Release year    Year of release of the corre-      cisely, given some seed tracks, the algorithm extracts the k
                                             sponding album.                    most similar playlists based on the number of shared items
                                                                                and recommends the tracks of these playlists. This algo-
                Table 1: Additional retrieved information.
                                                                                rithm does not require a training step and scans the entire
                                                                                set of available playlists for each recommendation.

                                                                                3.2     Detailed observations
                                            lastfm           AotM     8tracks
                                                                                   In the following sections, we will look at general distribu-
    Playlists                               1,172            5,043    3,130
                                                                                tions of different track characteristics.
    Tracks                                  24,754           61,935   29,732
    Artists                                 9,925            23,029   13,379    3.2.1    Popularity of tracks
    Avg. tracks/playlist                    26.0             19.7     12.5
                                                                                   The goal of the first analysis here is to determine if users
    Avg. artists/playlist                   16.8             17.8     11.5
                                                                                tend to position tracks in playlists depending on their pop-
    Avg. genres/playlist                    2.7              3.5      3.4
                                                                                ularity. In our analysis, we measure the popularity in terms
    Avg. tags/playlist                      473.4            418.7    297.4
                                                                                of play counts. Play counts were taken from last.fm, be-
    Avg. track usage count                  1.2              1.6      1.3
                                                                                cause this is one of the most popular services and the cor-
    Avg. artist usage count                 3.0              4.3      2.9
                                                                                responding values can be considered indicative for a larger
                                                                                user group.
              Table 2: Some basic statistics of the datasets.                      For the measurement, we split the playlists into two parts
                                                                                of equal size and then determined the average play counts on
                                                                                last.fm for the tracks for each half. To measure to which ex-
                                                                                tent the user community favors certain tracks in the playlists,
              1200                                                              we calculated the Gini index, a standard measure of inequal-
                                                                      Aotm      ity4 . Table 3 shows the results. In the last column, we re-
              1000
                                                                      last.fm   port the statistics for the tracks returned by The Echo Nest
Frequencies


              800
                                                                                (tEN) and kNN playlisters5 . We provided the first half of
                                                                      8tracks
              600                                                               the hand-crafted playlists as seed tracks and the playlisters
              400                                                               had to select the same number of tracks as the number of
                                                                                remaining tracks.
              200
                                                                                   The results show that users actually tend to place more
                 0                                                              popular items in the first part of the list in all datasets,
                     2   5    8   11 14 17 20 23 26 29 32 35 38 41 44 47 50     when play counts are considered. The Echo Nest playlister
                                            Playlist sizes                      does not seem to take that form of popularity into account
                                                                                4
                                                                                 We organized the average play counts in 100 bins.
                     Figure 1: Distribution of playlists sizes.                 5
                                                                                 We determined 10 as the best neighborhood size for our
                                                                                data sets based on the recall value, see Section 4.
   Play counts    1st half       2nd half       tEN                measure, we compared the creation year of each playlist with
   last.fm        1,007k         893k           629k               the average release year of its tracks. We limit our analysis
   AotM           671k           638k           606k               to the last.fm and 8tracks datasets because we only could
   8tracks        953k           897k           659k               acquire creation dates for these two.
   Gini index     1st half       2nd half       tEN                                     0.18
   last.fm        0.06           0.04           0.04                                    0.16


                                                                   Relative frequency
   AotM           0.20           0.18           0.22                                    0.14
   8tracks        0.09           0.09           0.08                                                                                         8tracks
                                                                                        0.12
                                                                                         0.1
                                                                                                                                             last.fm
   Play counts    1st half       2nd half       kNN                                     0.08
   last.fm        1,110k         943k           1,499k                                  0.06
   AotM           645k           617k           867k                                    0.04
   8tracks        1,008k         984k           1,140k                                  0.02
                                                                                          0
   Gini index     1st half       2nd half       kNN
                                                                                               0         5       10        15       20       25        30
   last.fm        0.12           0.09           0.33
                                                                                                         Average freshness of playlists (years)
   AotM           0.26           0.23           0.43
   8tracks        0.15           0.12           0.28
                                                                   Figure 2: Distribution of average freshness of
Table 3: Popularity of tracks in playlists (last.fm                playlists (comparing playlist creation date and track
play counts) and concentration bias (Gini coeffi-                  release date).
cient).
                                                                      Figure 2 shows the statistics for both datasets. We orga-
                                                                   nized the data points in bins (x-axis), where each bin repre-
and recommends on average less popular tracks. These dif-          sents an average-freshness level, and then counted how many
ferences are statistically significant according to a Student’s    playlists fall into these levels. The relative frequencies are
t-test (p < 10−5 for The Echo Nest playlister and p < 10−7         shown on the y-axis. The result are very similar for both
for the kNN playlister). This behavior indicates also that         datasets, with a slight tendency to include older tracks for
The Echo Nest is successfully replicating the fact that the        last.fm. On both datasets, more than half of the playlists
second halves of playlists are supposed to be less popular         contain tracks that were released on average in the last 5
than the first half.                                               years, the most frequent average age being between 4 and
   The Gini index reveals that there is a slightly stronger con-   5 years for last.fm and between 3 and 4 years for 8tracks.
centration on some tracks in the first half for two of three       Similarly, on both datasets, more than 75% of the playlists
datasets and the diversity slightly increases in the second        contain tracks that were released on average in the last 8
part. The absolute numbers cannot be directly compared             years.
across datasets, but for the AotM dataset the concentra-              We also analyzed the standard deviation of the resulting
tion is generally much higher, which is also indicated by the      freshness values and observed that more than half of the
higher “track reuse” in Table 2. Interestingly, The Echo Nest      playlists have a standard deviation of less than 4 (years),
playlister quite nicely reproduces the behavior of real users      while more than 75% have a standard deviation of less than 7
with respect to the diversity of popularity.                       (years) on both datasets. Overall, this suggests that playlists
   In the lower part of Table 3, we show the results for           made by users are often homogeneous with regard to the
the kNN method. Note that these statistics are based on            release date.
a different sample of the playlists than the previous mea-            Computing the freshness for the generated playlists would
surement. The reason is that both The Echo Nest and the            require to configure the playlisters in such a way that they
kNN playlisters cannot produce playlists for all of the first      select only tracks that were not released after the playlists’
halves provided as seed tracks. We therefore considered only       creation years. Unfortunately, The Echo Nest does not allow
playlists, for which the corresponding algorithm could pro-        such a configuration. Moreover, for the kNN approach, the
duce a playlist.                                                   playlists that are more recent would have to be ignored,
   Unlike the playlister of The Echo Nest, the kNN method          which would lead to a too small sample size and not very
has a strong trend to recommend mostly very popular items.         reliable results anymore.
This can be caused by the fact that the kNN method by
design recommends tracks that are often found in similar              3.2.3                        Homogeneity and diversity
playlists. Moreover, based on the lower half of Table 3, the          Homogeneity and diversity can be determined in a variety
popularity correlates strongly with the seed track popularity.     of ways. In the following, we will use simple measures based
As a result, the kNN shows a potentially undesirable trend         on artist and genre counts. The genres correspond to the
to reinforce already popular items to everyone. At the same        genres of the artists of the tracks retrieved from The Echo
time, it concentrates the track selection on a comparable          Nest. Basic figures for artist and genre diversity are already
small number of tracks as indicated by the very high value         given in Table 2. On AotM, for example, having several
for the Gini coefficient.                                          tracks of an artist in a playlist is not very common6 . On
                                                                   last.fm, we in contrast very often see two or more tracks of
3.2.2    The role of freshness
  Next, we analyzed if there is a tendency of users to create      6
                                                                     On 8tracks, artist repetitions are limited due to license con-
playlists that mainly contain recently released tracks. As a       straints
                                                                                          0.25
one artist in a playlist. A similar, very rough estimate can
                                                                                                                                   Energy [0,1]
be made for the genre diversity. If we ordered the tracks of                               0.2
a playlist by genre, we would encounter a different genre on                                                                       Hotttnesss [0,1]


                                                                    Relative frequency
last.fm only after having listened to about 10 tracks. On                                 0.15                                     Loudness [-100,100]
AotM and 8tracks, in contrast, playlists on average cover                                                                          Danceability [0,1]
more genres.                                                                               0.1
                                                                                                                                   Tempo [0,500]
   Table 4 shows the diversities of the first and second halves
                                                                                          0.05
of the hand-crafted playlists, and for the automatic selec-
tions using the first halves as seeds. As a measure of di-                                  0
versity, we simply counted the number of artists and genres                                      0   20     40            60             80              100
and divided by the corresponding number of tracks. The                                                            Scale
values in Table 4 correspond the averages of these diversity
measures.                                                           Figure 3: Distribution of The Echo Nest track mu-
                                                                    sical features independently of playlists.
                       1st half      2nd half      tEN
 last.fm     artists   0.74          0.76          0.93
             genres    2.26          2.30          2.12                                   0.16                                     8tracks - Energy
 AotM        artists   0.93          0.93          0.94
                                                                                          0.14                                     AotM - Energy
             genres    3.26          3.22          2.41
 8tracks     artists   0.97          0.98          0.99                                                                            last.fm - Energy
                                                                                          0.12
             genres    3.74          3.85          2.89                                                                            8tracks - Hotttnesss


                                                                     Relative frequency
                                                                                           0.1                                     AotM - Hotttnesss
                       1st half      2nd half      kNN                                                                             last.fm - Hotttnesss
                                                                                          0.08
 last.fm     artists   0.74          0.76          0.87
             genres    2.32          2.26          3.11                                   0.06
 AotM        artists   0.94          0.94          0.91
                                                                                          0.04
             genres    3.27          3.21          3.70
 8tracks     artists   0.97          0.98          0.93                                   0.02
             genres    3.94          3.92          4.06
                                                                                             0
                                                                                                 0   0.2     0.4           0.6            0.8             1
Table 4: Diversity of playlists (Number of artists                                                         Energy and Hotttnesss
and genres divided by the corresponding number of
tracks).
                                                                    Figure 4: Distribution of mean energy and “hottt-
                                                                    nesss” levels in playlists.
   Regarding the diversity of the hand-crafted playlists, the
tables show that users tend to keep a same level of artist and
genre diversity throughout the playlists. We can also notice           To understand if people tend to place tracks with specific
that the playlists of last.fm are much more homogeneous.            feature values into their playlists, we then computed the
The diversity values of the automatic selections reveal sev-        distribution of the average feature values of each playlist.
eral things. First, The Echo Nest playlister tends to always        Figure 4 shows the results of this measurement for the en-
maximize the artist diversity independently of the diversity        ergy and “hotttnesss” features. For all the other features
of the seeds; on the contrary, the kNN playlister lowered the       (danceability, loudness and tempo), the distributions were
initial artist diversities, except on the last.fm dataset, where    similar to those of Figure 3, which could mean that they are
it increased them, though less than The Echo Nest playlister.       generally not particularly important for the users.
Regarding the genre diversity, we can observe an opposite              When looking at the energy feature, we see that users tend
tendency for both playlisters: The Echo Nest playlister tends       to include tracks from a comparably narrow energy spectrum
to reduce the genre diversity while the kNN playlister tends        with a low average energy level, even though there exist
to increase it. Again, these difference are statistically signif-   more high-energy tracks in general as shown in Figure 3. A
icant (p < 0.03 for The Echo Nest playlister and p < 0.006          similar phenomenon of concentration on a certain range of
for the kNN playlister). Overall, the resulting diversities of      values can be observed for the “hotttnesss” feature. As a
the both approaches tend to be rather dissimilar to those of        side aspect, we can observe that the tracks shared on AotM
the hand-crafted playlists.                                         are on average slightly less “hottt” than those of both other
                                                                    platforms7 .
3.2.4      Musical features (The Echo Nest)                            We finally draw our attention to the feature distributions
  Figure 3 shows the overall relative frequency distribution        of the generated playlists. Figure 5 as an example shows
of the numerical features from The Echo Nest listed in Ta-          the distributions of the energy and “hotttnesss” factors for
ble 1 for the set of tracks appearing in our playlists on a         7
normalized scale. For the loudness feature, for example, we           The results for the “hotttnesss” we report here correspond
see that most tracks have values between 40 and 50 on the           to the values at the time when we retrieved the data using
                                                                    the API of The Echo Nest, and not to those at the time when
normalized scale. This would translate into an actual loud-         the playlists were created. This is not important as we do
ness value of -20 to 0 returned by The Echo Nest, given that        not look at the distributions independently, but compare
the range is -100 to 100.                                           them to the distributions in Figure 3.
                        0.1
                                         1st half                                                                                1st half     2nd half     tEN
                       0.09
                                         2nd half                                                        last.fm    artists      0.19         0.18         0
                       0.08
                                         tEN                                                                        genres       0.43         0.40         0.56
                       0.07
   Relative frequeny


                       0.06              kNN10                                                                      energy       0.76         0.71         0.77
                       0.05
                                                                                                                    hotttnesss   0.81         0.76         0.83
                       0.04                                                                              AotM       artists      0.05         0.05         0
                       0.03                                                                                         genres       0.24         0.22         0.50
                       0.02                                                                                         energy       0.75         0.74         0.75
                       0.01                                                                                         hotttnesss   0.83         0.82         0.85
                         0                                                                               8tracks    artists      0.02         0.01         0
                              0    0.1         0.2   0.3   0.4       0.5    0.6   0.7   0.8   0.9   1
                                                                   Energy
                                                                                                                    genres       0.22         0.22         0.52
                     0.25
                                                                                                                    energy       0.73         0.71         0.76
                                         1st half
                                                                                                                    hotttnesss   0.81         0.79         0.85
                                         2nd half
                       0.2
                                         tEN                                                                                     1st half     2nd half     kNN
Relative frequency


                                         kNN10
                     0.15                                                                                last.fm    artists      0.22         0.21         0.02
                                                                                                                    genres       0.44         0.42         0.14
                       0.1                                                                                          energy       0.76         0.76         0.75
                                                                                                                    hotttnesss   0.83         0.82         0.83
                     0.05
                                                                                                         AotM       artists      0.05         0.05         0.03
                        0
                                                                                                                    genres       0.22         0.21         0.13
                              0    0.1         0.2   0.3   0.4      0.5     0.6   0.7   0.8   0.9   1               energy       0.75         0.74         0.73
                                                                 Hotttnesss                                         hotttnesss   0.83         0.82         0.84
                                                                                                         8tracks    artists      0.02         0.01         0.03
Figure 5: Comparison of the distribution of energy                                                                  genres       0.22         0.22         0.17
and “hotttnesss” levels for hand-crafted and gener-                                                                 energy       0.74         0.73         0.74
ated playlists.                                                                                                     hotttnesss   0.82         0.80         0.84

                                                                                                        Table 5: Coherence of first, second and generated
the first halves and second halves of the playlists of all three                                        halves.
datasets, together with the distributions of the tracks se-
lected by The Echo Nest and kNN playlisters.
   The figure shows that The Echo Nest playlister tends to
produce a distribution that is quite similar to the distribu-                                             Another interesting phenomenon is the high artist coher-
tion of the seed tracks. The kNN playlister, in contrast,                                               ence values on the last.fm dataset. These values indicate
tends to concentrate the distributions toward the maximum                                               that last.fm users have a surprisingly strong tendency to
values of the distributions of the seeds. We could observe                                              group tracks from the same artist together, which was not
this phenomenon of concentration for all the features on                                                successfully reproduced by the two playlisters. Both playlis-
all three datasets, except for the danceability on the AotM                                             ters actually seem to have a tendency to produce always
dataset.                                                                                                the same coherence values, independently of the coherence
                                                                                                        values of the seed. A last interesting result is the high co-
  3.2.5                           Transitions and Coherence                                             herence of artist genres on the AotM and 8tracks datasets –
   We now focus on the importance of transitions between                                                the high genre coherence values on last.fm can be explained
the tracks, and define the coherence of a playlist as the av-                                           by the high artist coherence values.
erage similarity between its consecutive tracks. Such simi-
larities can be computed according to various criteria. We
used the binary cosine similarity of the genres and artists8 ,
and the Euclidean linear similarity for the numerical track
features of The Echo Nest. Table 5 shows the corresponding                                              4.   STANDARD ACCURACY METRICS
results for the first and second halves of the hand-crafted
playlists, and for the automatic selections using the first                                                Our analysis so far has revealed some particular charac-
halves as seeds.                                                                                        teristics of user-created playlists. Furthermore, we observed
   We can first see that for all datasets and for all criteria, the                                     that the nearest-neighbor playlisting scheme can produce
second halves of the playlists have a lower coherence than                                              playlists that are quite different to those generated by the
the first halves. If we assume that the coherence is repre-                                             commercial Echo Nest service, e.g., in terms of average track
sentative of the effort of the users to create good playlists,                                          popularity (Table 3).
then the tracks of the second halves seem to be slightly less                                              In the research literature, “hit rates” (recall) and the av-
carefully selected than those of the first halves.                                                      erage log-likelihood (ALL) are often used to compare the
                                                                                                        quality of playlists generated by different algorithms [2, 8,
8
  In the case of artists, this means that the similarity equals                                         11]. The goal of our next experiment was to find out how
1 if both tracks have the same artist, and 0 else. The met-                                             The Echo Nest playlister performs on these measures. As
ric thus measures the proportion of cases when the users                                                it is not possible to acquire probability values for the tracks
consecutively selected tracks from the same artist.                                                     selected by The Echo Nest playlister, the ALL cannot be
used9 . In the following we thus only focus on the precision                                  With respect to the evaluation protocol, note that we only
and recall.                                                                                measured precision and recall when the playlister was able to
   The upper part of Figure 6 shows the recall values at                                   return a playlist continuation given the seed tracks. This was
list length 100 for the different datasets10 . Again, we split                             however not always the case for both techniques. In Table 6,
the playlists and used the first half as seed tracks. Recall                               we therefore report the detailed coverage figures, which show
was then computed by comparing the computed playlists                                      that the kNN method was more often able to produce a
with the “hidden” tracks of the original playlist. We mea-                                 playlist. If recall is measured for all seed playlists, the dif-
sured recall for tracks, artists, genres and tags. The results                             ferences between the algorithms are even larger. When mea-
show that the kNN method quite clearly outperforms the                                     suring precision for all playlists, the differences between the
playlister of The Echo Nest on the recall measures across all                              playlisters become very small.
datasets except for the artist recall for the last.fm dataset.
The differences are statistically significant for all the ex-                                           Dataset       tEN         kNN
periments except for the track and artists recall on last.fm                                            last.fm       28.33       66.89
(p < 10−6 ) according to a Student’s t-test. As expected,                                               AotM          42.75       86.52
the kNN method leads to higher absolute values for larger                                               8tracks       35.3        43.8
datasets as more neighbors can be found.
                                                                                                    Table 6: Coverage of the playlisters.
 0.8
 0.7
                                                                                             Overall, measuring precision and recall when comparing
 0.6
 0.5
                                                                                           generated playlists with those provided by users in our view
 0.4                                                                                       represents only one particular form of assessing the quality
 0.3                                                                                       of a playlist generator and should be complemented with
 0.2                                                                                       additional measures. Precision and recall as measured in our
 0.1                                                                                       experiments for example do not consider track transitions.
   0
                                                                                           There is also no “punishment” if a generated playlist contains
        the Echo     kNN10          the Echo         kNN10         the Echo        kNN10
          Nest                        Nest                           Nest                  individual non-fitting tracks that would hurt the listener’s
                                                                                           overall enjoyment.
               last.fm                       AotM                         8tracks
              track recall       artist recall      genre recall      tag recall
  0.5
                                                                                           5.   PUBLIC AND PRIVATE PLAYLISTS
 0.45                                                                                         Some music platforms and in particular 8tracks let their
  0.4
 0.35
                                                                                           users create “private” playlists which are not visible to oth-
  0.3                                                                                      ers and public ones that for example are shared and used
 0.25                                                                                      for social interaction like parties, motivation for team sport
  0.2
 0.15                                                                                      or romantic evening. The question arises if public playlists
  0.1                                                                                      have different characteristics than those that were created
 0.05
    0
                                                                                           for personal use only, e.g., because sharing playlists to some
         the Echo        kNN10       the Echo           kNN10      the Echo        kNN10   extent can also serve the purpose of creating a public image
           Nest                        Nest                          Nest                  of oneself.
                last.fm                          AotM                     8tracks             We made an initial analysis on the 8tracks dataset. Ta-
        track precision      artist precision      genre precision       tag precision
                                                                                           ble 7 shows the average popularity of the tracks in the 8tracks
                                                                                           playlists depending on whether they were in “public” or “pri-
                                                                                           vate” playlists (the first category contains 2679 playlists and
Figure 6: Recall and Precision for the covered cases.                                      the second 451). As can be seen, the tracks of the private
                                                                                           playlists are much more popular on average than the tracks
   The lower part of Figure 6 presents the precision results.                              in the public playlists. Moreover, as indicated by the cor-
The precision values for tracks are as expected very low and                               responding Gini coefficients, the popular tracks are almost
close to zero which is caused by the huge set of possible                                  equally distributed across the playlists. Furthermore, Fig-
tracks and the list length of 100. We can however observe a                                ure 7 shows the corresponding freshness values. We can see
higher precision for the kNN method on the AotM dataset                                    that the private playlists generally contained more recent
(p < 10−11 ), which is the largest dataset. Regarding artist,                              tracks than public playlists.
genre and tag prediction, The Echo Nest playlister lead to
a higher precision (p < 10−3 ) than the kNN playlister on all                                                           Play counts   Gini index
datasets.                                                                                         Public playlists      870k          0.20
                                                                                                  Private playlists     935k          0.06
 9
   Another possible measure is the Mean Reciprocal Rank
 (MRR). Applied to playlist generation, one limitation of                                  Table 7: Popularity of tracks in 8tracks public and
 this metric is that it corresponds to the assumption that                                 private playlists and Gini index.
 the rank of the test track or artist to predict should be as
 high as possible in the recommendation list, although many
 other tracks or artist may be more relevant and should be                                   These results can be interpreted at least in two different
 ranked before.                                                                            ways. First, users might create some playlists for their per-
10                                                                                         sonal use to be able to repeatedly listen to the latest popular
   We could not measure longer list lengths as 100 is the max-
 imum playlist length returned by The Echo Nest.                                           tracks. They probably do not share these playlists because
                     0.16
                                                                                        Last, we plan to extend our experiments and analysis by
                     0.14                                                               considering other music services, in particular last.fm radios,
Relative frequency
                                                               Public playlists
                     0.12                                                               and other playlisting algorithms, in particular algorithms
                      0.1                                      Private playlists        that exploit content information.
                     0.08
                     0.06                                                               7.   REFERENCES
                     0.04                                                                [1] A. Andric and G. Haus. Estimating Quality of
                     0.02                                                                    Playlists by Sight. In Proc. AXMEDIS, pages 68–74,
                       0                                                                     2005.
                            0       5       10        15       20       25         30    [2] G. Bonnin and D. Jannach. Evaluating the Quality of
                                Average freshness of 8tracks playlists (years)               Playlists Based on Hand-Crafted Samples. In Proc.
                                                                                             ISMIR, pages 263–268, 2013.
Figure 7: Distribution of average freshness of                                           [3] G. Bonnin and D. Jannach. Automated generation of
8tracks public and private playlists.                                                        music playlists: Survey and experiments. ACM
                                                                                             Computing Surveys, 47(2), 2014.
                                                                                         [4] S. Chen, J. L. Moore, D. Turnbull, and T. Joachims.
sharing a list of current top hits might be of limited value                                 Playlist Prediction via Metric Embedding. In Proc.
for other platform members who might be generally more                                       KDD, pages 714–722, 2012.
interested in discovering not so popular artists and tracks.                             [5] S. Cunningham, D. Bainbridge, and A. Falconer.
Second, users might deliberately share playlists with less                                   ‘More of an Art than a Science’: Supporting the
popular or known artists and tracks to create a social image                                 Creation of Playlists and Mixes. In Proc. ISMIR,
on the platform.                                                                             pages 240–245, 2006.
   Given these first observations, we believe that our ap-
                                                                                         [6] A. Flexer, D. Schnitzer, M. Gasser, and G. Widmer.
proach has some potential to help us better understand some
                                                                                             Playlist Generation Using Start and End Songs. In
elements of user behavior on social platforms in general,
                                                                                             Proc. ISMIR, pages 173–178, 2008.
i.e., that people might not necessarily only share tracks that
                                                                                         [7] D. L. Hansen and J. Golbeck. Mixing It Up:
match their actual taste.
                                                                                             Recommending Collections of Items. In Proc. CHI,
                                                                                             pages 1217–1226, 2009.
6.                     SUMMARY AND OUTLOOK                                               [8] N. Hariri, B. Mobasher, and R. Burke. Context-Aware
  The goal of our work is to gain a better understanding                                     Music Recommendation Based on Latent Topic
of how users create playlists in order to be able to design                                  Sequential Patterns. In Proc. RecSys, pages 131–138,
future playlisting algorithms that take these “natural” char-                                2012.
acteristics into account. The first results reported in this                             [9] M. Kamalzadeh, D. Baur, and T. Möller. A Survey on
paper indicate, for example, that features like track fresh-                                 Music Listening and Management Behaviours. In
ness, popularity aspects, or homogeneity of the tracks are                                   Proc. ISMIR, pages 373–378, 2012.
relevant for users, but not yet fully taken into account by
                                                                                        [10] A. Lehtiniemi and J. Seppänen. Evaluation of
current algorithms that are considered to create high-quality
                                                                                             Automatic Mobile Playlist Generator. In Proc. MC,
playlists in the literature. Overall, the observations also in-
                                                                                             pages 452–459, 2007.
dicate that additional metrics might be required to assess
the quality of computer-generated playlists in experimental                             [11] B. McFee and G. R. Lanckriet. The Natural Language
settings that are based on historical data such as existing                                  of Playlists. In Proc. ISMIR, pages 537–542, 2011.
playlists or listening logs.                                                            [12] G. Reynolds, D. Barry, T. Burke, and E. Coyle.
  Given the richness of the available data, many more analy-                                 Interacting With Large Music Collections: Towards
ses are possible. Currently, we are exploring “semantic” char-                               the Use of Environmental Metadata. In Proc. ICME,
acteristics to automatically identify the underlying theme                                   pages 989–992, 2008.
or topic of the playlists. Another aspect not considered so                             [13] A. M. Sarroff and M. Casey. Modeling and Predicting
far in our research is the popularity of the playlists. For                                  Song Adjacencies In Commercial Albums. In Proc.
some music platforms, listening counts and “like” statements                                 SMC, 2012.
for playlists are available. This additional information can                            [14] M. Slaney and W. White. Measuring Playlist Diversity
be used to further differentiate between “good” and “bad”                                    for Recommendation Systems. In Proc. AMCMM,
playlists and help us obtain more fine-granular differences                                  pages 77–82, 2006.
with respect to the corresponding playlist characteristics.