=Paper=
{{Paper
|id=None
|storemode=property
|title=Using Song Social Tags and Topic Models to Describe and Compare Playlists
|pdfUrl=https://ceur-ws.org/Vol-633/wom2010_paper2.pdf
|volume=Vol-633
}}
==Using Song Social Tags and Topic Models to Describe and Compare Playlists==
<pdf width="1500px">https://ceur-ws.org/Vol-633/wom2010_paper2.pdf</pdf>
<pre>
Using Song Social Tags and Topic Models to Describe and
                   Compare Playlists

                                  Ben Fields, Christophe Rhodes and Mark d’Inverno
                                                        Department of Computing
                                                      Goldsmiths University of London
                                                               New Cross
                                                           London, SE14 6NW
                                                             United Kingdom
                                          [b.fields | c.rhodes | dinverno]@gold.ac.uk

ABSTRACT                                                                   or will be played after the current recommended song. Yet
Playlists are a natural delivery method for music recom-                   little is understood about how playback order affects the
mendation and discovery systems. Recommender systems                       success or failure of a recommendation of a piece of music.
offering playlists must strive to make them relevant and en-               Whether a system makes user-based, object-based or hybrid
joyable. In this paper we survey many current means of gen-                recommendations, a better awareness and use of playback
erating and evaluating playlists. We present a means of com-               order will yield an improved music recommender system.
paring playlists in a reduced dimensional space through the                   In order to take advantage of the effect of playback order,
use of aggregated tag clouds and topic models. To evaluate                 it is necessary to have some means of comparing playlists
the fitness of this measure, we perform prototypical retrieval             with one another. While ratings-based generic recommender
tasks on playlists taken from radio station logs gathered from             strategies could be employed, such techniques could only
Radio Paradise and Yes.com, using tags from Last.fm with                   be used in systems which allow for the rating of playlists
the result showing better than random performance when                     directly (as opposed to the much more common rating of
using the query playlist’s station as ground truth, while fail-            member songs).Alternatively, n distance measure between
ing to do so when using time of day as ground truth. We then               playlists can be used to facilitate the prediction and gen-
discuss possible applications for this measurement technique               eration of well-ordered lists of song sequences for recom-
as well as ways it might be improved.                                      mendation. This has the advantage being applicable to the
                                                                           vast majority of existing playlist generation systems, many
                                                                           of which do not to collect playlist level ratings from their
Categories and Subject Descriptors                                         users. Further, a measure of playlist distance has a number
H.5.5 [Sound and Music Computing]: Signal analysis,                        of other applications in music recommender and discovery
synthesis, and processing; H.5.1 [Multimedia Informa-                      systems including label propagation, predictive personaliza-
tion Systems]: Evaluation/methodology                                      tion and context tuning to name a few.
                                                                              In this paper we propose an objective distance measure be-
Keywords                                                                   tween playlists. To better understand why such a measure
                                                                           is needed, Section 2 provides background information in ex-
LDA, Topic Models, playlists, music, similarity, information               isting playlist generation and evaluation techniques. While
retrieval, metric space, social tags                                       any sufficiently expressive and low-dimensional feature is
                                                                           compatible with our playlist measure, we use a novel so-
1.    INTRODUCTION                                                         cial tag-based feature in this paper. This song-level feature
   Inherent to the design of any recommender or retrieval                  is detailed in Section 3. This is followed by an explanation
system is a means of display or delivery of selected content.              of our distance measurement itself in Section 4. Putting
For a system that recommends music this means playback                     this into practice, we detail some proof of concept evalua-
of an audio file. Listening to or playing a piece of music                 tion in Section 5. We discuss the results of this evaluation
take the length time of that piece of music. Given this link               and possible extensions in Section 6.
between music and time, when considering what information
is relevant for a recommendation it is vital to consider the               2.   PLAYLIST AS DELIVERY MECHANISM
context of time; that is, what music has been played before
                                                                              In this section we survey the use of playlists in the de-
                                                                           livery of content in existing recommendation and retrieval
                                                                           systems. This is followed by a review of current evaluation
                                                                           methods for generated playlists. These two survey points
                                                                           will show both the widespread use of playlist generation in
WOMRAD 2010 Workshop on Music Recommendation and Discovery,                music recommendation and discovery systems and the need
colocated with ACM RecSys 2010 (Barcelona, SPAIN)                          for more quality evaluation of these systems.
Copyright c . This is an open-access article distributed under the terms      While this brief survey is focused on automatic playlist
of the Creative Commons Attribution License 3.0 Unported, which permits
unrestricted use, distribution, and reproduction in any medium, provided   generation, there is a wealth of both academic and lay work
the original author and source are credited.                               discussing various aspects manual human-driven playlist con-
struction that may be of interest to the reader. Work in this      A recent approach uses co-occurrence in n-grams extracted
area tends to deal with radio (e.g. [1]) or club and dance disc    from the internet radio station Radio Paradise1 to deform
jockeys (e.g. [13]), being the two principal areas where the       a content-based similarity space [26]. This deformed space
explicit construction of ordered lists of songs are tied to the    is then used in a manner that is similar to [18] to generate
field. It is with these areas of manual playlist construction      paths from one song to another, minimizing step distance
in mind that we will examine past efforts in both automatic        throughout the path.
playlist construction and evaluation techniques.                      Also of note is [31], which in contrast to most of the pre-
                                                                   vious systems, uses nearest neighbor co-occurrence in radio
2.1    Usage in the Wild                                           playlist logs to determine song similarity. While the evalua-
                                                                   tion was preliminary this method shows promise.
   There have been many music recommendation and re-
trieval systems that employ some kind of automatic playlist        2.2     Evaluation Methods
construction within their system. Frequently this is done as
a means of content delivery or, less often, as a way of facil-        The most prevalent method of evaluation used in playlist
itating human evaluation of an underlying process such as          generation systems is direct human evaluation by listening.
content-based music similarity or recommendation. What             The system detailed in [29], a rule-based automatic playlist
follows is a brief survey of existing methods of playlist gen-     generator that uses features derived from metadata, is simi-
eration both with and without human intervention.                  lar to [2, 30]. Of note in [29] is the thorough human listener
   A web based system for personalized radio is detailed           testing which shows the automatic playlist generator per-
in [20]. In this early system users create and publish playlists   forming considerably better than songs ordered randomly.
facilitated through a process analogous to collaborative fil-      This evaluation, though better than most, still fails to com-
tering. This results in quasi-automatic playlist creation,         pare the automatic playlists against human expert playlists.
with any sequence ordering depending entirely on the user.         Additionally, to reduce test time, the evaluation uses arbi-
Another variation of the social interaction intermediary is        trary one minute clips from the songs rather than the en-
shown in [27], which presents the Jukola system. This sys-         tirety of the song or an intentionally chosen segment. A
tem creates playlists via democratic vote on every song us-        content-based similarity playlist generator with a novel eval-
ing mobile devices of listeners in the same physical space.        uation is seen in [28]. Here the authors track the number
Furthering the ideas of collaborative human generation, [25]       times the user presses the skip button to move on from the
shows a system called Social Playlist. This system is based        currently playing song. All songs that are skipped are con-
on the idea of social interaction through playlist sharing,        sidered false positives and those that are completely played
integrating mobile devices and communal playback.                  are treated as true positives. From this many standard in-
   A fully automatic rule-based system is described in [2].        formation retrieval techniques can be used in the evaluation,
This system uses existing metadata such as artist name,            resulting in a rich understanding of the results. Ultimately,
song title, duration and beats per minute. The system is           it is still human user listening evaluation though and its
designed from the ground up to be scalable and is shown to         biggest drawback is playback time. Assuming an average
work given a database of 200000 tracks. An approach that           song length of five minutes it would take an an hour and 40
is derived from recommender systems is seen in [4]. Here the       minutes (per listener) to listen to 20 songs with no time for
authors use the ratings and personalization information to         the skipped songs. This skip-based evaluation framework is
derive radio for a group. An attempt to optimize a playlist        further used in [12] where existing last.fm user logs (which
based on known user preference as encoded in song selection        include skip behavior) are analyzed using fuzzy set theory to
patterns is shown in [30]. This effort uses Gaussian process       determine playlist generation heuristics in the system. Ad-
regression on user preference to infer playlists. The system       ditionally, many systems of playlist generation lack formal
uses existing a priori metadata as the features for selection.     evaluation all together.
A means of using webmining derived artist similarity with
content-based song similarity is used to automatically gener-
                                                                   2.3     Summary
ate playlists in [22]. This system combined these two spaces         While a number of techniques have been employed to cre-
in such a way as to minimize the use of signal analysis. A         ate playlists for a variety of functions, there exist limited
byproduct of this optimization is improved playlist genera-        techniques in the evaluation of generated playlists. These
tion as is shown in a small evaluation with human listeners.       evaluation techniques rely heavily on time consuming hu-
   The Poolcasting system is detailed in [5, 6]. Poolcasting       man evaluation. Beyond that, there is no studied means to
uses dynamic weighting of user preferences within a group of       objectively compare one playlist with another. In Section 4
users who are all listening to a common stream with the goal       we will propose just such a means. First we will describe a
of minimizing displeasure across the entire group. This re-        novel song level feature based on tags. A tag-based feature
sults in a system that is very similar to popular commercial       will encode socio-cultural data that is missing from analo-
radio in terms of its output. A method for created playlists       gous content-based features, though social tags bring about
using an artist social graph, weighted with acoustic similar-      some other problems.
ity is shown in [17]. This method takes a start and end song
and constructs a playlist using maximum flow analysis on the       3.     TOPIC-MODELED TAG-CLOUDS
weighted graph. Another technique for playlist construction          In order to encode playlists in a low dimensional repre-
based on the selection of paths between the start and end          sentation we must first represent their member songs in as a
songs is shown in [18]. In this system content-based similar-      low dimensional vector. Here we use a Topic-Modeled Tag
ity is used to project a set of songs onto a 2-D map, then a       Cloud (TMTC) as a pseudo-content-based feature, in a way
path is found from the start song to the end song with the
                                                                   1
goal of minimizing the step size between each member song.             http://radioparadise.com
                    Figure 1: The tag cloud for Bohemian Crapsody by Sickboy, from Last.fm.


                                                                                                                β
that is functionally analogous to various pure content-based
methods. Using tags and topic models in this way is novel
and what follows is an explanation of the process of building
this feature.

3.1    Tags as Representation
   A tag is a word or phrase used to describe a document
of some kind, typically on the Web. Various kinds of doc-
                                                                    α               θ                   z                 w
uments are described using tags on the Web including pho-                                                                     N
tos2 , videos3 and music4 . An aggregated collection of tags,                                                                     M
weighted by the number of users who ascribe it to a given
object, is commonly referred to as a tag cloud.                   Figure 2: The graphic model of LDA [11]. The repli-
   Tag clouds get their name from the most common visual-         cates are represented as the two boxes. The outer
ization method used with them, where each tag is displayed        box M represents the corpus of documents, while
with the font size in proportion to the weight, arranged in       the inner box N represents the repeating choice of
a way that resembles a cloud. An example of a tag cloud5          topics and words which make up each document.
can be seen in Figure 1 As can be seen in this example, tag
clouds provide a rich description of the music it describes.
Tags and collections of tags in various forms provide the ba-     we require. Topic models are described in [10] as “probabilis-
sis for many techniques within music informatics including        tic models for uncovering the underlying semantic structure
recommendation, retrieval and discovery applications [3,23].      of [a] document collection based on a hierarchical Bayesian
   In addition to human generated tags being used, there is       analysis of the original text.” In topic modeling, a document
some research directed toward the automatic application of        is transformed into a bag of words, in which all of the words
tags and inference of associated weights on unlabeled pieces      of a document are collected and the frequency of the occur-
of music [7, 9, 16, 21].                                          rence in recorded. We can use the weighted collection of
                                                                  tags in a tag cloud as this bag of words, with tags serving
3.2    Reducing the Dimensionality                                as tokenized words.
   There exist some techniques (such as [8]) to determine            There are a few different ways of generating topic models;
semantic clustering within a tag cloud; however, these sys-       for our feature generation we will be using latent Dirichlet
tems are built to facilitate browsing and do not create a         allocation [11], treating each tag cloud as a bag-of-words.
sufficiently reduced dimensional representation. The pre-         In LDA, documents (in our case tags clouds of songs) are
vious work of [24] comes the closest to the needed dimen-         represented as a mixture of implied (or latent) topics, where
sional reduction, also dealing with social tags for music. This   each topic can be described as a distribution of words (or
work, through the use of aspect models and latent seman-          here, tags).More formally give the hyper-parameter α, and
tic analysis, brings the dimensionality down into the hun-        the conditional multinomial parameter β, Equation 3.2 gives
dreds, while preserving meaning. This order of dimensions         the joint topic distribution θ, a set of N topics z and a set
is still too high to compute meaningful distance across multi-    of N tags w.
song playlists. A feature with dimensionally of the order 102
                                                                                                    N
would suffer from the curse of dimensionality [33]: because                                         Y
of its high dimensionality, any attempt to measure distance              p(θ, z, w|α, β) = p(θ|α)         p(zn |θ)p(wn |zn , β)   (1)
                                                                                                    n=1
becomes dominated by noise. However, a technique devel-
oped for improved modelling in text information retrieval,         In Figure 2 LDA is shown as a probabilistic graphical model.
topic models provide the reduced dimensional representation       In order to create topic models using LDA, we need to spec-
2                                                                 ify p(θ|α) and p(zn |θ). We estimate our parameters empir-
  e.g. http://flickr.com
3                                                                 ically from a given corpus of tag clouds. This estimation
  e.g. http://youtube.com
4                                                                 is done using variational EM as described in [11].This al-
  e.g. http://last.fm or http://musicbrainz.org
5
  This tag cloud is for the track Bohemian Crapsody by the        lows topic distributions to be generated in an unsupervised
artist Sickboy. The tags and the rendering both come from         fashion, though the number of topics in a corpus must be
last.fm, available at http://www.last.fm/music/Sickboy/           specified a priori.
_/Bohemian+Crapsody/+tags                                            Once the LDA model is generated, it is used to infer the
                                                                  complete playlist in a dataset. The distance between two
                                                                  playlists is then the minimum distance between any two
                  gather tags for all songs                       length i sub-vectors drawn from each playlist. One effect of
                                                                  this technique is easy handling of playlists of unequal length.
                                                                     This type of distance measurement has been used with
                                                                  success on sequences of audio frames [14, 15]. The distance
                                                                  measure in use between vectors can also be changed. In par-
                                                                  ticular there has been work showing that statistical features
                                                                  (such as topic models) may benefit from the use of Manhat-
               create LDA model describing                        tan distance [19], however for our prototypical evaluation we
                     topic distributions                          have used simple Euclidean distance as seen in equation ??
                                                                  above.

                                                                  5.    EVALUATION
                                                                    The goal of our evaluation is to show the fitness of our
                                                                  distance measurement through preliminary retrieval tests:
                 infer topic mixtures for all                     searching for playlists that start at the same time of day as
                            songs                                 our query playlist and searching for the playlists from the
                                                                  same station from a database of stations of the same genre.
                                                                  We examine the logs of a large collection of radio stations,
                                                                  exhaustively searching example sets. Through precision and
                                                                  recall we see that our measure organizes playlists in a pre-
                                                                  dictable and expected way.
                  create vector database                          5.1    Dataset
                        of playlists                                 In order to test these proposed techniques a collection
                                                                  of radio station logs were gathered. These logs come from
                                                                  a collection of broadcast and online stations gathered via
Figure 3: The complete process for construction of                Yes.com7 . The logs cover the songs played by all indexed
a TCTM feature set.                                               stations between 19-26 March 2010. For our evaluation task
                                                                  using this data source we looked at subsets of this com-
                                                                  plete capture, based on genre labels applied to these sta-
mixture of topics present in the tag cloud for a given song.      tions. Specifically we examine stations of the genres rock
This is done via variational inference which is shown in [11]     and jazz. The complete Yes.com dataset also includes sta-
to estimate the topic mixture of a document by iteratively        tions in the following genre categories: Christian, Country,
minimizing the KL divergence from variational distribution        Electronica, Hip-Hop, Latin, Metal, Pop, Punk, R&B/Soul,
of the latent variables and the true posterior p(θ, z|w, α, β).   Smooth Jazz and World. These labels are applied by the sta-
   This process in it’s entirety is shown as a block diagram      tions themselves and the categories are curated by Yes.com.
in Figure 3. Once this process is completed for every song        Additionally, the play logs from Radio Paradise8 from 1 Jan-
in our dataset, we will have a single vector with a dimen-        uary 2007 to 28 August 2008 form a second set. We then
sionality equal to the number of topics in our LDA whose          attempted to retrieve tag clouds from Last.fm9 for all songs
entries indicate topic occupancy for that song.                   in these logs. When tags were not found the song and its
                                                                  associated playlist were removed from our dataset
                                                                     These logs are then parsed into playlists. For the radio
4.    PLAYLISTS AS A SEQUENCE OF TOPIC                            logs retrieved via the Yes api, the top of every hour was used
      WEIGHTS                                                     as a segmentation point as a facsimile for the boundary be-
  Given the single vector per song reduction, we represent        tween distinct programs. This is done under the assumption
the playlists these song are in as ordered sequences of these     that program are more likely than not to start and finish
vectors. Thus each playlist is represented as a l×d-dimensional   on the hour in US commercial broadcast. Note that this
vector, where l is the number of songs in a given playlist and    method of boundary placement will almost certainly over-
d is the number of topics in our LDA model.                       segment radio programs as many radio programs are longer
                                                                  than one hour. However, given that our distance measure
4.1    Measuring Distance                                         compares fixed length song sequences across playlists, this
   To both manage and measure the distance between these          over-segmentation should produce only minimal distortion
li × d dimensional vectors we use audioDB6 . The use of           in our results. The Radio Paradise logs include all the links
audioDB to match vectors of this type is detailed in [32].        or breaks between songs where the presenter speaks briefly.
Briefly, distance is calculated by means of a multidimen-         For experiments using the Radio Paradise logs these links are
sional Euclidian measure. Here li is an arbitrary length sub-     used as playlist boundaries. This leads to a slight difference
sequence of i vectors. In practice, i is Casey:2008selected to    in the type of playlist used from Radio Paradise versus Yes.
be less than or equal to the smallest sequence length for a       7
                                                                    http://api.yes.com
6                                                                 8
 source and binary available at http://omras2.doc.gold.             http://www.radioparadise.com/
                                                                  9
ac.uk/software/audiodb/                                             http://last.fm
                              source                 St      Smt    Pt                                             Pavg(time)        Pavg(songs)
                             whole set            885810    2543   70190                                                 55min           12.62
                             “Rock” stations      105952     865   9414                                                  53min           11.25
                             “Jazz” stations      36593     1092   3787                                                  55min            9.66
                             “Radio Paradise”     195691    2246   45284                                                 16min            4.32


Table 1: Basic statistics for both the radio log datasets. Symbols are as follows: St is the total number of
song entries found in the dataset; Smt is the total number of songs in St where tags could not be found; Pt is
total number of playlists; Pavg(time) is the average runtime of these playlists and Pavg(songs) is the mean number
of songs per playlist.


The playlists coming from Radio Paradise represent strings         hypothesis, perhaps due to progaming with no reliance on
of continuously played songs, with no breaks between the           time of day, at least in the case of Radio Paradise.
songs in the playlists. The playlists from Yes are approxima-
tions of a complete radio program and can therefore contain
                                                                                                                            Playlist beginning at midnight on 1 January 2007
some material inserted between songs (e.g. presenter link,                                                       25000
commercials).
   Statistics for our dataset can be see in Table 1 we then                                                      20000


                                                                         delta between start times, in seconds
use the tags clouds for these songs to estimate LDA topic
models as described in Section 310 . For all our experiments
we specify 10 topic models a priori. The five most relevant                                                      15000
tags in each of the topics in models trained on both the rock
and jazz stations can be seen Table 2.                                                                           10000

5.2    Daily Patterns
   Our first evaluation looks at the difference between the                                                       5000
time of day a given query playlist starts and the start time
for the closest n playlists by our measure. For this evaluation                                                     00             50              100           150           200
we looked at the 18 month log from Radio Paradise as well as                                                                                  result position
the “Rock” and “jazz” labelled stations from Yes.com, each
in turn. Further we used a twelve hour clock to account for
                                                                   Figure 5: The time of day difference from the query
The basis for this test relies on the hypothesis that for much
                                                                   playlist for 200 returned results, showing even time
commercial radio content in the United States, branding of
                                                                   of day spread. Note that all the results show here
programs is based on daily repeatable of tone and content
                                                                   have a distance of 0 from the query.
for a given time of day. It should therefore be expected
that playlists with similar contours would occur at similar
times of day across stations competing for similar markets
of listeners.
                                                                   5.3                                            Inter-station vs. Intra-station
   Figure 4 shows the mean across all query playlists of the          In this evaluation we examined the precision and recall
time difference for each result position for the closest n re-     of retrieving playlists from the same station as the query
sults, where n is 200 for the Radio Paradise set and 100           playlist. Here we looked at the “Rock” and “Jazz” labelled
for the Yes.com set. The mean time difference across all           stations retrieved via the Yes API, each in turn. Similar to
three sets is basically flat, with an average time difference      the first task, it is expected that a given station will have
of just below 11000 or about three hours. Given the max-           its own tone or particular feel that should lead to playlists
imum difference of 12 hours, this result is entirely the op-       from that station being more apt to match playlist from
posite of compelling, with the retrieved results showing no        their generating station then with other stations from the
corespondance to time of day. Further investigation is re-         same genre. More formally, for each query we treat returned
quired to determine whether this is a failure of the distance      playlists as relevant, true positives when they come from
metric or simply an accurate portrail of the radio stations        the same station as the query playlist and false positives
logs. A deeper examination of some of the Yes.com data             otherwise. Based on this relevance assumption, precision
shows some evidence of the latter case. Many of the playlist       and recall can be calculated using the following standard
queries exactly match (distance of 0) with the entirity of the     equations.
200 returned results. Further these exact match playlists are
                                                                                                   T
                                                                             |{relevantplaylists} {retrievedplaylists}|
repeated evenly throughout the day. One of these queries is             P =                                                  (2)
                                                                                         |{retrievedplaylists}|
shown in Figure 5. The existance of these repeating playlists
throughout the day, ensures this task will not confirm our                                                                               T
                                                                                                                     |{relevantplaylists} {retrievedplaylists}|
10                                                                         R=                                                                                                        (3)
 Our topic models are created using the open source imple-                                                                      |{relevantplaylists}|
mentation of LDA found in the gensim python package avail-
able at http://nlp.fi.muni.cz/projekty/gensim/ which               The precision versus recall for a selection of stations’ playlists
in turn is based on Blei’s C implementation available at           from both the “Rock” and “Jazz” stations are show in Figure
http://www.cs.princeton.edu/~blei/lda-c/                           6. When considering the precision and recall performance it
 station label   t1                    t2                     t3                        t4                        t5
                 Snow Patrol           Bob Marley             female vocalists          aupa Pete                 80s
                 rumba                 Feist                  Anna Nalick               whistling                 new wave
 “Rock”          90s                   john mayer             Chicas                    Triple J Hottest 100      david bowie
                 green day             drunk love             playlist 2009             review                    neuentd
                 Dynamit               feist backing vocals   Sarah McLachlan           fun as fuck               synth pop
                 motown                john mayer             60s                       Sade                      Flamenco
                 soul                  acoustic               jazz - sax                deserves another listen   tactile smooth jazz
 “Jazz”          70s                   corinne bailey rae     acid jazz                 till you come to me       guitar ponder
                 funk                  bonnie raitt           reggae                    piano                     cafe mocha
                 Disco                 David Pack 2           cool jazz                 2010                      wine
 station label   t6                    t7                     t8                        t9                        t10
                 classic rock          TRB                    reminds me of winter      Needtobreathe             Krista Brickbauer
                 60s                   ElectronicaDance       kings of leon             plvaronaswow2009          day end
 “Rock”          70s                   mysterious             songs that save my life   The Script                i bought a toothbrush
                 The Beatles           best songs of 2009     songs to travel           brilliant music           bluegrass
                 the rolling stones    tribute to george      Muse                      van morrison              omg
                 follow-up             rnb                    female vocalists          classic rock              Smooth Jazz
                 jazz                  soul                   norah jones               80s                       saxophone
 “Jazz”          instrumental          female vocalists       dido                      rock                      smooth jazz sax
                 guitar                Neo-Soul               jazz                      70s                       contemporary jazz
                 latin jazz            Robin Thicke           vocal jazz                yacht rock                instrumental


Table 2: The five most relevant tags in each topic. Upper model is all the Yes.com Rock stations, lower
model is all Yes.com Jazz stations.


is useful to compare against random chance retrieval. There          cipal among these is the exploration of non-Euclidean dis-
are 100 stations labeled “Rock” and 48 labeled “Jazz”. Under         tance measures. Manhattan distance (or L1 ) seems to have
chance retrieval a precision of 0.01 would be seen for “Rock”        the most direct applicability and its use could prove to be
and 0.0208 for “Jazz”.                                               quite beneficial. Another area for future work is in the use
                                                                     of the measure on further data and datasets. One of the
5.4    Summary                                                       best ways to improve here would be in the use of datasets
  Two different evaluation tasks have been run using real            with a more exact known ground truth, in order to best ap-
world radio log data to examine the usefulness of our playlist       ply known recommender and retrieval evaluation methods
match technique. The first of these, an examination the              to them.
time difference was flat across result length variance. While          This leads to a further avenue of future work, testing the
this implies lack of discrimination into daily patterns, it is       measure against direct human evaluation. While our match-
not possible to determine from the available data whether            ing technique has many uses with recommendation and dis-
this is an accurate reflection of the progamming within the          covery, if it proved to align with human evaluation it would
dataset or distance measure not being sufficient for the task.       be considerably more useful.
The second task shows the performance of retrieving hourly
playlists from a selection of stations using playlists from that     7.    ACKNOWLEDGMENTS
station as a query. Here we see a great deal of promise,
especially when comparing the query results against random              This work is supported in part by the Engineering and
chance, which it outperforms considerably.                           Physical Sciences Research Council via the Online Music
                                                                     Recognition And Searching II (OMRAS2) project, refer-
                                                                     ence number EP/E02274X/1. Additional support provided
6.    CONCLUSIONS                                                    as part of the Networked Environments for Music Analysis
   Having reviewed recent work in various methods of playlist        (NEMA) project, funded by The Andrew W. Mellon Foun-
generation and evaluation in Section 2, it is apparent that          dation. Thanks also to Paul Lamere for some dataset acqui-
there is a need for better ways to objectively compare playlists     sition assistance.
to one another. We detailed a method of doing so in Section
4, though first, to better filter content-based data through
listeners’ experience we presented a novel tag-based feature,        8.    REFERENCES
TMTC, using tags summarized using LDA topic models in                 [1] J. A. Ahlkvist and R. Faulkner. ‘Will This Record
Section 3. This was follow by two task evaluations to exam-               Work for Us?’: Managing Music Formats in
ine out playlist matching technique and song feature on real              Commercial Radio. Qualitative Sociology,
world playlist data from radio logs in Section 5.                         25(2):189–215, June 2002.
   While our evaluation shows the promise of this technique           [2] J.-J. Aucouturier and F. Pachet. Scaling up playlist
on sampled data, there is much room for improvement. Prin-                generation. In Proc. IEEE International Conference
                                                              Radio Paradise, Start time delta v. result order                                                         YES.com Rock stations, Start time delta v. result order                                                     YES.com Jazz stations, Start time delta v. result order

                                                   14000                                                                                                       14000                                                                                                      14000


           delta between start times, in seconds


                                                                                                                       delta between start times, in seconds


                                                                                                                                                                                                                                  delta between start times, in seconds
                                                   12000                                                                                                       12000                                                                                                      12000


                                                   10000                                                                                                       10000                                                                                                      10000


                                                   8000                                                                                                        8000                                                                                                       8000


                                                   6000                                                                                                        6000                                                                                                       6000

                                                        0                50          100            150          200                                                0           20         40             60     80         100                                                0            20         40             60      80         100
                                                                                result position                                                                                             result position                                                                                             result position


                                                               Figure 4: The mean start time difference, with squared error of the mean.


                                                                         WAPS - Akron, OH - Rock                                                                             WNDT - Gainesville - Ocala, FL - Rock                                                                 WHRL - Albany - Schenectady - Troy, NY - Rock
                                                     1.0                                                                                                         1.0                                                                                                        1.0

                                                     0.8                                                                                                         0.8                                                                                                        0.8

                                                     0.6                                                                                                         0.6                                                                                                        0.6
                             precision


                                                                                                                                         precision


                                                                                                                                                                                                                                                    precision
                                                     0.4                                                                                                         0.4                                                                                                        0.4

                                                     0.2                                                                                                         0.2                                                                                                        0.2

                                                     0.00.0        0.2         0.4            0.6     0.8        1.0                                             0.00.0         0.2        0.4            0.6    0.8        1.0                                             0.00.0          0.2        0.4            0.6    0.8         1.0
                                                                                   recall                                                                                                    recall                                                                                                    recall
                                                                         WCLK - Atlanta,  GA - Jazz                                                                               WEAA - Baltimore, MD - Jazz                                                                            WJZX - Milwaukee - Racine, WI - Jazz
                                                     1.0                                                                                                         1.0                                                                                                        1.0

                                                     0.8                                                                                                         0.8                                                                                                        0.8

                                                     0.6                                                                                                         0.6                                                                                                        0.6
                             precision


                                                                                                                                         precision


                                                                                                                                                                                                                                                    precision
                                                     0.4                                                                                                         0.4                                                                                                        0.4

                                                     0.2                                                                                                         0.2                                                                                                        0.2

                                                     0.00.0        0.2         0.4            0.6     0.8        1.0                                             0.00.0         0.2        0.4            0.6    0.8        1.0                                             0.00.0          0.2        0.4            0.6    0.8         1.0
                                                                                     recall                                                                                                      recall                                                                                                      recall


Figure 6: Precision versus Recall for six stations when using their hourly playlists to query for other playlists
from the same station. In each query the number of results retrieved is selected to maximize the F1 score.


     on Multimedia and Expo, 2002.                                                                                                                                                                [7] L. Barrington, D. Turnbull, and G. Lanckriet.
 [3] J.-J. Aucouturier and E. Pampalk. Introduction-from                                                                                                                                              Auto-tagging music content with semantic
     genres to tags: A little epistemology of music                                                                                                                                                   multinomials. In Proc. of Int. Conference on Music
     information retrieval research. Journal of New Music                                                                                                                                             Information Retrieval, 2008.
     Research, 37(2):87–92, 2008.                                                                                                                                                                 [8] G. Begelman, P. Keller, and F. Smadja. Automated
 [4] P. Avesani, P. Massa, M. Nori, and A. Susi.                                                                                                                                                      Tag Clustering: Improving search and exploration in
     Collaborative radio community. In AH ’02:                                                                                                                                                        the tag space. In In Proc. of the Collaborative Web
     Proceedings of the Second International Conference on                                                                                                                                            Tagging Workshop at WWW’06, 2006.
     Adaptive Hypermedia and Adaptive Web-Based                                                                                                                                                   [9] T. Bertin-Mahieux, D. Eck, F. Maillet, and P. Lamere.
     Systems, pages 462–465. Springer Berlin / Heideberg,                                                                                                                                             Autotagger: a model for predicting social tags from
     January 2002.                                                                                                                                                                                    acoustic features on large music databases. Journal of
 [5] C. Baccigalupo. Poolcasting: an intelligent technique                                                                                                                                            New Music Research, 37(2):101–121, 2008.
     to customise music programmes for their audience.                                                                                                                                           [10] D. Blei and J. Lafferty. Topic Models. Text Mining:
     PhD thesis, Institut d’Investigació en Intelligència                                                                                                                                           Theory and Applications. Taylor and Francis, 2009.
     Artificial, 2009.                                                                                                                                                                           [11] D. M. Blei, A. Ng, and M. Jordan. Latent dirichlet
 [6] C. Baccigalupo and E. Plaza. Sharing and combining                                                                                                                                               allocation. Journal of Machine Learning Research,
     listening experience: A social approach to web radio.                                                                                                                                            3:993–1022, 2003.
     In Proc. of the International Computer Music                                                                                                                                                [12] K. Bosteels, E. Pampalk, and E. E. Kerre. Evaluating
     Confernce, 2007.                                                                                                                                                                                 and analysing dynamic playlist generation heuristics
     using radio logs and fuzzy set theory. In Proc. of Int.        2008.
     Conference on Music Information Retrieval, October        [24] M. Levy and M. Sandler. Learning latent semantic
     2009.                                                          models for music from social tags. Journal of New
[13] B. Brewster and F. Broughton. Last Night A DJ Saved            Music Research, 37(2):137 – 150, 2008.
     My Life; The history of the disc jockey. Headline Book    [25] K. Liu and R. A. Reimer. Social playlist: enabling
     Publishing, London, United Kingdom, 2nd edition,               touch points and enriching ongoing relationships
     2006.                                                          through collaborative mobile music listening. In
[14] M. Casey, C. Rhodes, and M. Slaney. Analysis of                MobileHCI ’08: Proceedings of the 10th international
     minimum distances in high-dimensional musical                  conference on Human computer interaction with
     spaces. Audio, Speech, and Language Processing,                mobile devices and services, pages 403–406, New York,
     IEEE Transactions on, 16(5):1015 –1028, jul. 2008.             NY, USA, 2008. ACM.
[15] M. Casey and M. Slaney. The importance of sequences       [26] F. Maillet, D. Eck, G. Desjardins, and P. Lamere.
     in music similarity. In Proc. IEEE Int. Conf. on               Steerable playlist generation by learning song
     Acoustics, Speech, Signal Processing, Toulouse, France,        similarity from radio station playlists. In Proc. of Int.
     2006.                                                          Conference on Music Information Retrieval, October
[16] D. Eck, P. Lamere, T. Bertin-Mahieux, and S. Green.            2009.
     Automatic generation of social tags for music             [27] K. O’Hara, M. Lipson, M. Jansen, A. Unger,
     recommendation. In Neural Information Processing               H. Jeffries, and P. Macer. Jukola: democratic music
     Systems Conference (NIPS) 20, 2007.                            choice in a public space. In DIS ’04: Proceedings of
[17] B. Fields, K. Jacobson, C. Rhodes, and M. Casey.               the 5th conference on Designing interactive systems,
     Social playlists and bottleneck measurements :                 pages 145–154, New York, NY, USA, 2004. ACM.
     Exploiting musician social graphs using content-based     [28] E. Pampalk, T. Pohle, and G. Widmer. Dynamic
     dissimilarity and pairwise maximum flow values. In             playlist generation based on skipping behavior. In
     Proc. of Int. Symposium on Music Information                   Proc. of Int. Symposium on Music Information
     Retrieval, September 2008.                                     Retrieval, 2005.
[18] A. Flexer, D. Schnitzer, M. Gasser, and G. Widmer.        [29] S. Pauws and B. Eggen. Pats: Realization and user
     Playlist generation using start and end songs. In Proc.        evaluation of an automatic playlist generator. In Proc.
     of Int. Symposium on Music Information Retrieval,              of Int. Conference on Music Information Retrieval,
     October 2008.                                                  2002.
[19] K. Grauman and T. Darrell. Fast contour matching          [30] J. C. Platt, C. J. Burges, S. Swenson, C. Weare, and
     using approximate earth mover’s distance. In Proc. of          A. Zheng. Learning a gaussian process prior for
     the IEEE Computer Society Conference on Computer               automatically generating music playlists. In Proc.
     Vision and Pattern Recognition, 2004.                          Advances in Neural Information Processing Systems,
[20] C. Hayes and P. Cunningham. Smart radio: Building              volume 14, pages 1425–1432, 2002.
     music radio on the fly. In Expert Systems 2000, pages     [31] R. Ragno, C. Burges, and C. Herley. Inferring
     2–6. ACM Press, 2000.                                          similarity between music objects with application to
[21] M. D. Hoffman, D. M. Blei, and P. R. Cook. Easy as             playlist generation. In Proc. 7th ACM SIGMM
     cba: a simple probabilistic model for tagging music. In        international workshop on Multimedia information
     Proc. of Int. Conference on Music Information                  retrieval, 2005.
     Retrieval, 2009.                                          [32] C. Rhodes, T. Crawford, M. Casey, and M. d’Inverno.
[22] P. Knees, T. Pohle, M. Schedl, and G. Widmer.                  Investigating music collections at different scales with
     Combining audio-based similarity with web-based data           audiodb. Journal of New Music Research, to appear,
     to accelerate automatic music playlist generation. In          2010.
     Proc. 8th ACM international workshop on Multimedia        [33] R. Weber, H. J. Schek, and S. Blott. A quantitative
     information retrieval, pages 147 – 154, 2006.                  analysis and performance study for similarity-search
[23] P. Lamere. Social tagging and music information                methods in high-dimensional spaces. In Proc. of the
     retrieval. Journal of New Music Research, 37(2), June          Intl. Conf. on Very Large Databases, 1998.

</pre>