=Paper= {{Paper |id=None |storemode=property |title=Music Recommendation in the Personal Long Tail: Using a Social-based Analysis of a User’s Long-Tailed Listening Behavior |pdfUrl=https://ceur-ws.org/Vol-633/wom2010_paper9.pdf |volume=Vol-633 }} ==Music Recommendation in the Personal Long Tail: Using a Social-based Analysis of a User’s Long-Tailed Listening Behavior== https://ceur-ws.org/Vol-633/wom2010_paper9.pdf
          Music Recommendation in the Personal Long Tail:
 Using a Social-based Analysis of a Userʼs Long-Tailed Listening Behavior
                Kibeom Lee                                        Woon Seung Yeo                                 Kyogu Lee
       Graduate School of Culture                          Graduate School of Culture                 Department of Digital Contents
          Technology, KAIST                                   Technology, KAIST                       Convergence, Seoul National
            Daejeon, Korea                                      Daejeon, Korea                                University
                                                                                                             Seoul, Korea
          kiblee@kaist.ac.kr                                      woon@kaist.ac.kr
                                                                                                            kglee@snu.ac.kr



ABSTRACT                                                                     sales have moved from physical album sales to digital sales from
                                                                             online stores. Currently, these services offer millions of tracks to
The online music industry has been growing at a fast pace,                   users, the catalog growing rapidly in size compared to the size
especially during the recent years. Even music sales have moved              when the services were first announced. For instance, Amazon
from physical sales to digital sales, paving the way for millions of         offered over 2 million songs to users when the music service
digital music becoming available for all users. However, this                launched, but now offers over 11.8 million songs as of 2010.
produces information overload, where there are so many items                 Some notable online music stores, including Amazon, are
available due to, virtually, no storage limitations, it becomes              Amazon MP3 (11,000,000+ songs), iTunes Store (12,000,000+
difficult for users to find what they are looking for. There have            songs) and Rhapsody (9,000,000+ songs). Apart from music
been many approaches in recommending music to users to tackle                stores, there are also music streaming services that offer millions
information overload. One successful approach is collaborative               of songs, such as Lala (8,000,000 songs), Spotify (8,000,000
filtering, which is currently widely used in commercial services.            songs), and Last.fm (7,000,000 songs).
Although collaborative filtering produces very satisfying results, it
becomes prone to popularity bias, recommending items that are                These large numbers of songs available to users are a result of the
correct recommendations but quite "obvious". In this paper, a new            Long Tail business model [1], contrary to only products that were
recommendation algorithm is proposed that is based on                        in demand being sold in stores. However, as a result, although
collaborative filtering and focuses on producing novel                       paradoxical, users have ended up listening to less music now that
recommendations. The algorithm produces novel, yet relevant,                 so much is available, simply because it is hard to find new and
recommendations to users based on analyzing the users' and the               relevant music. For instance, digital track sales surpassed the 1
entire population's listening behaviors. An online user test shows           billion sales mark in 2008. However, the Top 200 digital tracks
that the system is able to produce relevant and novel                        alone accounted for 17% of the entire track sales (184 million
recommendations and has greater potential with some minor                    sales) [2].
adjustments in parameters.
                                                                             2. RELATED WORK
Categories and Subject Descriptors
I.1.2 [Computing Methodologies]: Algorithms – Nonalgebraic                   2.1 Collaborative Filtering-based
algorithms, analysis of algorithms                                           Recommender Systems
                                                                             One of the earliest recommender systems based on collaborative
General Terms                                                                filtering is Tapestry [3]. Stemming from the need to handle
Algorithms                                                                   increasing numbers of emails, Tapestry used explicit opinions of
                                                                             people in a relatively small group, such as an office workgroup, to
                                                                             filter out incoming email for a given user. However, a drawback
Keywords                                                                     of this system was that users had to be familiar with the
Recommender    systems,           collaborative      filtering,     music    preferences and opinions of other people in their network, which
recommendation                                                               is why Tapestry worked on small networks like the office.
                                                                             A more general collaborative filtering approach was developed by
1. INTRODUCTION                                                              Resnick et al. called GroupLens [4]. The basic idea behind
With advances in the Internet, lower hardware costs, increasing              GroupLens, which aimed to help users find news articles amongst
peer-to-peer networks, and the popularity of high-storage portable           the vast available numbers, was that "people who agreed in the
media players, the online music industry has been growing                    past will probably agree again". Using this heuristic, the
rapidly, especially during the past few years. Gradually, music              GroupLens system was able to predict the ratings of certain news
 WOMRAD 2010 Workshop on Music Recommendation and Discovery,                 articles by a given user. An advantage that this provided was that
 colocated with ACM RecSys 2010 (Barcelona, SPAIN)                           the collaborative filtering could be scaled, unlike Tapestry,
 Copyright (c). This is an open-access article distributed under the terms   because a user was not required to actually know other users that
 of the Creative Commons Attribution License 3.0 Unported, which             had similar preferences to him. This was done by the system,
 permits unrestricted use, distribution, and reproduction in any medium,     which gathered information on the ratings of users, naturally
 provided the original author and source are credited.
creating another advantage of users being anonymous inside the           proposed a method for generating social tags for music that lack
whole system.                                                            such tags [9]. Audio features of songs were analyzed and mapped
                                                                         to tags, using a set of boosted classifiers. These were then utilized
Research related to, and including, the above studies focused on         on untagged songs, populating them with the associated social
filtering a vast amount of text, which were in forms of emails,          tags depending on the musical content. This enables unpopular
news, and messages, to those that were worth reading. Items
                                                                         songs and/or new songs that have no social tags to be used in
would be given to the user with their prediction scores, aiding the      music recommenders that use a social algorithm. It also tackles
user in which item to read next. The next wave of studies focused        the cold start problem, a problem found in collaborative filtering-
on a more direct approach in recommending items.                         based recommender systems. Symeonidis et. al analyzed social
Ringo was a system developed to provide personalized music               tags in order to tackle the problem of the multimodal use of music
recommendations [5]. It maintained a user's profile, a history of        [10]. They developed a framework that modeled users, tags, and
ratings on various artists that were essentially explicit labelings on   items, altogether. This was then used in recommending musical
which artists the user does or does not enjoy listening to. These        items (artists, songs, and albums) to users by performing latent
profiles were matched by the system to calculate                         semantic analysis and dimensionality reduction according to each
recommendations on which artists had the highest probabilities of        user's multimodal perception of music. Levy and Sandler inspect
being liked by the user.                                                 the seemingly ad hoc and informal language of tagging as a high-
                                                                         volume source of semantic metadata for music. Results show that
While Ringo was focused on music items, Bellcore's                       tags establish a low-dimensional semantic space, being extremely
recommender system focused on movies [6]. Like Ringo, it used a          polished at the track level, especially by artist and genre. Using
database of movie ratings by users and matched rating profiles to        these results, the authors also introduce an interface for users to
provide recommendations by finding similar users and the movies          browse by mood, through a two-dimensional subspace that
that they had watched and rated positively. Tests on the reliability     represents musical emotion.
of the recommender system showed that three out of every four
recommendations would be rated highly by the user, and also              Celma introduces a system that recommends music and the
showed that the system produced extremely more accurate                  relevant information associated with the recommended music
recommendations compared to nationally-known movie critics.              [11]. The proposed system uses the Friend of a Friend and RSS
                                                                         vocabularies for creating recommendations, taking in
While there were numerous advances and algorithms related to             consideration the user's musical tastes and listening habits. The
collaborative filtering since then, the most well-known                  FOAF project provides protocols and a language to describe
collaborative filtering system today, however, is probably the           homepage-like content and social networks, ultimately providing
system used in Amazon.com, an electronic commerce company                the proposed system with the user's profile. The RSS vocabulary
that sells books, movies, music, etc. Amazon.com offers                  provides the system with syndicated content, which includes data
recommendations on items that are similar to the item being              such as new album releases, album reviews, podcast sessions,
purchased, rather than finding similar users and then                    upcoming gigs, etc. Thus, the proposed system improves the
recommending the items those users have purchased. This                  existing recommendation systems by understanding the users
method, which is called item-to-item collaborative filtering, scales     through psychological factors (personality, demographic
to extremely large datasets and generates satisfiable results.           preferences, socioeconomics, situation, social relationships) and
                                                                         explicit music preferences.
2.2 Collaborative Filtering-based
Recommender Systems for Music                                            3. LIMITATIONS OF COLLABORATIVE
Although the collaborative filtering-based approaches above were         FILTERING
designed on specific items, the algorithms can be generalized and
applied to music recommendation. Hence, the results of such              3.1 Popularity Bias
algorithms applied to music are not much different than applied to       Collaborative filtering-based recommender systems produce good
the original items.                                                      results and are used widely in commercial services such as
                                                                         Amazon.com and Last.fm. However, collaborative filtering has
Apart from recommender systems that use data on the ratings              some common limitations that occur naturally due to its roots
and/or purchases of items, there are other collaborative filtering-      lying in the wisdom of crowds. One of the largest problems of
based recommender systems that take advantage of metadata                collaborative filtering is popularity bias [12, 13].
produced by users that are found in music.
                                                                         This happens when a popular item is associated with many other
[7] presents some examples of metadata used in such algorithms,          related items. Users that interact with these items are then
which include reviews, lyrics, blogs, social tags, bios, and             recommended the popular item. The system recommends the
playlists. Examples of commercial services that use such                 popular item often, leading to item purchases (or any other form
approaches are Rate Your Music (reviews), The Hype Machine               of positive input from user) and as this item is purchased more, it
(blogs), last.fm (social tags), and playlist.com (playlists).            is also recommended more. This loop, in which the rich become
Social tags, a representative product of online collaboration, has       richer, creates popularity bias.
been used heavily in music recommendation systems. Hu and                Naturally, as a result of the above feedback loop, the
Downie explored the mood metadata associated with songs and              recommender system tends to bias its recommendations towards
their relationships with music genre, artist, and usage metadata         popular items. Thus, the recommendations lose their novelty [12,
[8]. They found that the genre-mood relationships and artist-mood        13] and make it extremely difficult to recommend lesser-known
relationships showed consistencies, showing the potential of being       artists.
utilized in automated mood classification tasks. Eck et. al
In Amazon.com, in which collaborative filtering is heavily used,        4.1 Concept of Recommendation Algorithm
the popularity bias can be seen when viewing the
recommendations that are offered when searching for popular             4.1.1 Changing Perspectives on Novel
items. For instance, the 98 recommendations that appear when            Recommendations
searching for Harry Potter includes The Da Vinci Code, To Kill a        While the goal of recommenders in general is to provide
Mockingbird and 28 other Harry Potter books and DVDs. In the            recommendations that are novel and relevant to the user, as stated
case of music, searching for The Beatles' Revolver album results        beforehand social-based recommendations, although relevant, fail
in 33 albums from The Beatles, out of a total of 97                     in providing novel recommendations to users. In contrast, content-
recommendations, as shown in Figure 1. The other recommended            based recommender systems work better in providing novel
items show well-known artists that user's, who are interested in        recommendations because they are not affected by popularity or
The Beatles, will most likely have heard of already such as The         any other social influence [15].
Rolling Stones, Led Zeppelin, and Neil Young. These                     Another method to provide novel recommendations to users is to
recommended artists are correct recommendations but fail to be          use the long tail popularity distribution of the artists [7]. This idea
novel recommendations.                                                  can be applied to both content-based and social-based algorithms.
                                                                        Content-based algorithms can use the long tail distribution to
                                                                        recommend similar items based on content-analysis and also
                                                                        found in the tail portion of the distribution. For social-based
                                                                        algorithms, or collaborative filtering, the idea can be applied by
                                                                        first obtaining the full list of recommendations and then removing
                                                                        the recommendations that lie in the head portion of the
                                                                        distribution. This would result in recommendations being novel to
                                                                        the user, since it is unlikely that artists residing in the tail portion
                                                                        of the distribution are known to the user.
                                                                        However, although strictly recommending artists from the long
                                                                        tail and avoiding recommending those that are obvious (those that
                                                                        are located in the head portion of the distribution) have a high
                                                                        probability of being novel recommendations, we need to take in
                                                                        consideration that novel recommendations are relative to the user.
    Figure 1. Recommendations from Amazon.com, which                    In other words, it is naive to assume that the user will be aware of
     are all quite "obvious" recommendations, although                  certain artists just because they are in the head portion of the long
              they are correct recommendations.                         tail distribution. Thus, the fact that even popular artists have a
Due to this popularity bias, a large portion of the recommended         possibility of being novel recommendations to certain users must
                                                                        not be overlooked.
items result in obvious recommendations that may be relevant to
easy-going, casual listeners, but not so helpful for enthusiastic       4.1.2 User Listening Behavior
music listeners, who have a high probability of already being           As shown in Figure 2, which shows a random Last.fm user's
knowledgeable on the artists being recommended.                         playlist in descending order of playcount, the listening behavior
The number of high quality, or "correct", recommended items that        shows a distribution that is similar to that of long-tail
are produced with collaborative filtering is verified by [14].          distributions. Users tend to listen to an extremely small portion of
However, the problem of popularity bias was also verified as the        their playlists often while the remaining songs seldom get played.
amount of novel recommendations given to a user was the lowest          Due to the data available, which is the top 500 played songs in the
for collaborative filtering in an experiment comparing                  user's playlist, all of the songs in the graph are played at least
collaborative filtering, content-based, and hybrid methods [14].        once.
Thus, it was confirmed that collaborative filtering results in less
percentage of novel songs but of higher quality.

4. ALGORITHM
In this section, we provide an algorithm that is based on
collaborative filtering, yet overcomes popularity bias, a natural
problem that arises from CF. Also, the algorithm focuses on              Figure 2. The listening behavior of a user and his/her entire
providing recommendations that are novel to the user, while also          playlist. Although not exact, the graph shows a long-tailed
remaining relevant.                                                      distribution where the majority of tracks are seldom played.
To implement this algorithm, user data from Last.fm, an Internet
service that provides users with streaming music via radio
                                                                        4.1.3 Defining Experts and Novices
stations, was used. Reasons for selecting Last.fm was the readily
                                                                        Using this long-tailed distribution of users' listening behaviors, the
available developer API and the various, massive amount of data
                                                                        users can be divided into two groups: experts and novices. Here,
that was available such as user playlists, playcounts for artists and
                                                                        users are considered "experts" regarding the songs/artists that they
individual songs, artist information, song information, and most
                                                                        listen to often, i.e. songs/artists that lie in the head portion of the
importantly, the worldwide popularity of the site.
                                                                        long-tailed listening behavior. On the other hand, users are
                                                                        considered "novices" regarding the songs that reside in the tail
                                                                        portion.
4.1.4 The Mystery of Unpopular “Loved” Songs
Last.fm provides users with an option to mark songs "loved"
(Figure 3). This kind of feedback from users explicitly shows that
a user enjoys a particular song. One would expect that these
"loved" songs would all lie in the head portion of the listening
behavior distribution. However, these songs that are marked
"loved" can be found scattered throughout the entire distribution.
Here, a paradox can be found: Why are some songs marked
"loved" lying at the tail end of the playcount distribution? One
would assume that a "loved" song would have a high playcount,
but a quick inspection shows that this is not the case. Thus, an
assumption that is made here, a key assumption in this algorithm,
is that songs are marked "loved", yet remain in the tail, because
the user is unfamiliar with that song/artist/genre, i.e. is a novice,
but happened to stumble upon that particular song and liked it.


                                                                            Figure 4. The overview of the algorithm showing the
                                                                                       concept of novices and experts.

                                                                        By using the listening behavior of experts to provide
                                                                        recommendations to novices, the recommended items will be
                                                                        novel to the user, contrasting to other recommendation systems
                                                                        that simply recommended artists/songs from the tail of the
                                                                        popularity distribution of items. In other words, while remaining
                                                                        novel to the specific user, the recommended items may or may not
                                                                        be in the far, unpopular end of the popularity distribution. In fact,
                                                                        even popular items that reside in the head of the popularity
   Figure 3. The "tail" portion of a random user’s playlist.            distribution may be recommended, but the user may not be aware
  There are two songs marked "loved" by the user, but have              of the recommended item since the recommendations were based
                   only been played three times.                        on the user's tail portion of her listening behavior distribution, in
Among the 21,688 users whose data was used for the algorithm,           which the user was considered a novice.
78.3%, or 16,973 users, used the "love" function provided in
                                                                        In addition to being novel recommendations, the recommended
Last.fm. Among the 16,973 users who utilized the "love" function,
                                                                        items will also be relevant to the user since the recommendations
77.8% of the users had "loved" songs in the tail portion of their
                                                                        were found using songs that the user had marked "loved",
playlist's song distribution sorted by playcount.
                                                                        explicitly stating the user's view on that particular item, and then
Upon closer inspection of the random user in Figure 3, the              using collaborative filtering to find experts on those "loved" songs
songs/artists in the "head" portion came from various genres such       to find relevant recommendations.
as electronic, hip-hop, and reggae. What they did have in
common, however, was that they were all German artists,                 4.2 Data
including the user herself. Looking at the songs that were marked       User data was collected in order to test the algorithm and evaluate
"loved" but were not played often, we can see that they too come        the results of the recommendations from early March to late April
from different genres, but are both artists from the U.S.               in 2010. Data was collected from the Last.fm website using a
                                                                        custom web crawler and the Last.fm API. The user data that was
The previously mentioned assumption that fuels this algorithm           collected included the songs that the user had listened to overall,
was made after observing such occurrences in users' playlists.          meaning the songs that the user listened to from the day he/she
According to our assumption, we assume that the user, who is            registered at Last.fm up until the day the data was collected. It
German, is a novice in artists from the U.S. and stumbled across        also included the playcount for each song, song title, artist name,
several songs that she liked. However, she did not get to venture       user ID, rank, and whether it was marked "loved" or not. The data
similar songs and/or artists because she was unaware of which           that was collected is summarized below in Table 1.
artists/songs were similar.
                                                                                Table 1. Summary of amount of data collected
4.1.5 The Big Picture                                                                 Data                               Count
Once the basic assumptions are made and the new definition of
                                                                        Users                               21,681
novices and experts are established, the concept of the
recommendation algorithm can be explained. As shown in Figure           Unique Songs                        2,001,324
4, recommendations can be made to novices of certain song sets          Songs from All Playlists            9,073,681
using the information that can be obtained by a group of experts
that have those song sets in the head portion of their listening
behavior distribution.
4.2.1 Last.fm API                                                       strength of the match between the songs in the expert's "head" and
All the collected information, except the playlist history, was         song set S.
gathered via the Last.fm API. Although the algorithm could have
queried the information in real-time, it was decided that having          begin Recommendations REC (aGivenUser U1);
local data would facilitate in quicker results. After fetching the         do
data, we had song titles and corresponding artist names of
approximately 2 million songs.                                               Result R1 := retrieveListeningBehaviorDistribution(U1);

In addition to the user and song data collected with the Last.fm             SongSet S1 := getSongsInLongTail(R1);
API, artist popularity was also measured indirectly via the API.             S1_loved := filterLovedSongs(S1);
Because the Last.fm API did not provide the artist ranking                   for i := 2 to n (n: number of users) step 1 do
directly through the API, we had to collect the number of
Listeners and Plays, which were offered through the API. By                     Result Ri := retrieveListeningBehaviorDistribution(Ui)
having the Listeners and Plays of a given artist, we would be able              SongSet Si := getSongsInHead(Ri);
to determine the overall ranking of popularity of the artists. This
                                                                                if (Si ∩ S1 ≠ ∅) do
will be further explained in the next section.
                                                                                  CandidateSongSet CSi := (Si ∪ S1) – (Si ∩ S1);
4.2.2 User Data Crawler
Unfortunately, the Last.fm API query for a given user's listening                 incrementWeight(CSi);
history returns the top 50 songs ordered by playcount. This was                   REC += CSi; od
not adequate enough since the algorithm needed the entire playlist
                                                                                od
in order to utilize the long tail of the playcount distribution.
                                                                             od
In order to solve this problem, a custom crawler was implemented
to collect the users' listening history (referred to as ‘playlist’ in        printRecommendations();
this paper) and playcount information. Although this returned a            end; 1. Pseudoalgorithm for proposed recommender
                                                                           Listing
maximum of 500 results (Last.fm displays only top 500 songs in                                  system.
the playlist), the data was adequate to be divided into the short
head and long tail and used in the algorithm.                           These recommendation candidates are accumulated in the global
                                                                        song set REC, and the weight of the candidate are incremented as
Data on a total of 21,681 random users were crawled. The                they are recommended to REC. Finally, the recommendations are
playlists and the according information were also stored for each       given to the user in the order of their weights.
user, resulting in 21,681 playlists with a total of 9,073,681 songs.
Because playlists from different users contain lots of duplicate        4.4 Parameters
entries, the number of unique songs that were crawled, as stated        The algorithm is quite flexible as it has many parameters that can
above, was 2,001,324 unique songs.                                      be changed, which greatly influences the recommended items to
                                                                        the user. Parameters that play a crucial role in the overall quality
4.3 Algorithm                                                           of the recommendations include:
As shown in Listing 1, the user that will receive the                        •    The size of the “head” of experts
recommendations, whom we will call "novice" according to the                 •    The size of the “tail” of novices
algorithm's concept, is given as input to the algorithm. Then, the           •    Weights of recommended items
listening behavior for the novice is retrieved using data available
at Last.fm. As long as the user is not a new user and has been
listening to his/her playlist, the playcount distribution of his/her
                                                                        4.4.1 Expert Parameter
playlist is more than likely to show a long-tailed distribution, in     The parameter that influences the outcome most is the size of the
which a small set of songs have been listened with a heavily            "head" portion of the expert's listening behavior distribution. For
                                                                        example, if the value for this parameter is set to "10", a user is
biased frequency while the remaining songs listened only
occasionally. Since we are interested in the songs/artists that the     considered an expert only if the top ten songs that s/he listened to
given user is a novice on (i.e. songs marked “loved” in the long        contains any number of songs from the set of songs that are
tail), we discard the head portion of the distribution and from the     marked "loved" in the novice's "tail" portion of his/her listening
remaining songs, which are songs in the tail portion, we discard        distribution. In other words, this parameter determines the
                                                                        qualification strictness on which users are considered experts.
all songs except those that are explicitly labeled "loved" by the
novice. These remaining songs, denoted by ‘S_1’, will be the song       The lower the value, the harder it is for a given user to be
set that will be used to create recommendations.                        considered an expert. Also, as the value is lower, the resulting
Next, using the listening behavior of the other users from our          recommendations are more novel, in contrast to when the values
database, we find those that listen to the songs in song set S. In      are higher, in which the resulting recommendations become those
other words, we find the "experts" on song set S by finding users       that are well-known. As the value is set higher, the
                                                                        recommendations represent those that are from the existing music
that have a subset of song set S in the head portion of their
listening behavior distribution. If such users exist, we compare the    recommendations that are offered using traditional collaborative-
                                                                        filtering methods.
songs in the “head” of their playcount distribution with song set S
and use the remaining, non-overlapping songs as recommendation          4.4.2 Novice Parameter
candidates and assign the weight for those items according to the       The parameter that can be varied for the novice users is the size of
                                                                        the "tail" portion of the novice's listening behavior distribution.
Opposite of the expert parameter, the novice parameter sets the         user would also be one that was pre-calculated was extremely
range of songs in the user's playlist that the user is a novice on.     low. When the user returns, he/she is presented with two sets of
Using loved songs that lay near the "head" portion may result in        recommendations.
songs that the user is aware of, leading to the recommendations
being less novel to the novice. However, this parameter does not        Recommendation Set 1 was the results of the algorithm with the
                                                                        Expert Parameter, the parameter that determines the size of the
have as much influence as the expert parameter has because once
the novice parameter is set, the entire range of songs are not used,    "head" portion of the expert, set to 5. A value of 5 for the Expert
but only those that are explicitly marked "loved" by the user.          Parameter means that the algorithm is being very strict about
                                                                        which users are qualified to be experts. This produces dense novel
                                                                        items. Recommendation Set 2 was the results with the Expert
4.4.3 Weights of Recommended Items                                      Parameter set to 10. A value of 10 tends to mix novel
A formal set of rules and equations to assign weights to the            recommendations and well-known recommendations, so is more
recommended items can greatly change the songs that will be             of a general setting that aims to resemble recommendations from
presented to the user as recommendations. This is important             Last.fm. After the user views the recommendations, a survey page
because it is inappropriate to present the entire collection of songs   was available to provide explicit feedback on the quality of the
that result from the algorithm, as the number may vary depending        recommendations given to them.
on the two parameters above. Among the final song set that
contains hundreds of candidate songs for recommendations, only
a subset, namely the top N songs are presented to the user. Thus,
assigning the appropriate weights for these candidates can
ultimately influence the outcome of the recommended items.
Currently, the algorithm uses a simple approach in which the
weight is equal to the number of times a song is a member of both
the head of an expert and tail of the novice.

5. USER TEST & EVALUATION
There are many ways to evaluate a recommender system, both
offline and online. A common online method to evaluate a
recommender system is to generate test sets to be evaluated later
[16]. Another popular method is to use cross-validation, in which
the data is partitioned and used as test sets [17].

5.1 Difficulties in Evaluating Novel
Recommendations
However, offline evaluations are not appropriate for recommender         Figure 5. Screenshot of the recommended items at the user-
systems where the recommendations of novel items are important.         test website. Each facet of the recommended items are linked
This is because when a truly novel item is actually recommended               to pages at Last.fm for supplementary information
to a user, meaning that the user does not already know about this
item, it is extremely difficult for the user to evaluate the unknown    Since the goal of the algorithm is to provide novel
item without providing any additional information [18]. Because         recommendations, there had to be an easy way for the user to
of this, measuring novelty in the recommended items is a rather         evaluate the recommended items, since it is assumed that if the
challenging task, leaving no option but to carry out live user          recommended items are indeed novel, then the user has no
studies where the users explicitly indicate whether the provided        knowledge about the item. Thus, each recommended item was
recommendations were novel or not [19].                                 hyperlinked to the according page in Last.fm, as shown in Figure
                                                                        5. Through these links, users were able to evaluate the
Thus, in order to measure the novelty and relevance of the              recommended items that were novel to them by visiting the linked
recommended items, an online user test was carried out using a          pages. Last.fm provides related information regarding specific
fully functional website, including a section for explicit user         songs, which include music videos, song previews, and even a
feedback regarding the recommendations given to the users.              radio for the song's artist. By utilizing these pages, users were able
                                                                        to listen to the songs that were recommended to them.
5.2 Design
A fully functional website was created in order to perform an
online evaluation of the recommendations for random users. On
                                                                        5.3 Survey
the website, a user has to sign-up and input his/her Last.fm ID.        On the survey page, a set of five questions were given to the user,
After receiving a new ID, the server runs the recommendation            each regarding one of the two sets of recommendation results that
algorithm on that particular Last.fm ID. Meanwhile, the user was        were produced by the algorithm. The questions were answered on
requested to come back shortly afterwards, while the                    a five-point Likert item. The final question was a subjective
recommendations were being processed. The algorithm had to be           question, asking for any comments or feedbacks on the
run in real-time online because of the nature of it being heavily       recommendations. The questions used in the survey are shown in
dependent on the user information. Also, pre-calculating the            Table 2.
recommendations for users in the local database offline and then
providing them online was unrealistic as the probability that a new
          Table 2. Questions used in the user survey.
Q. 1   How would you rate the relevance of items?
Q. 2   How would you rate the novelty of the recommended
       items?
Q. 3   How would you rate the serendipity of the recommended
       items?
Q. 4   How would you rate the recommendations overall?
Q. 5   Provide   any      comments/feedback       about  the
       recommendations that were given to you.

                                                                      Figure 8. Comparison of the overall ratings for the two sets.
6. RESULTS & DISCUSSION
A user survey was carried out online accompanying the online         The results of the user test on the recommendations produced by
music recommendation service because of the difficulties in          the proposed algorithm are generally positive. The mean value for
measuring novelty. A total of 24 users tested the                    the relevance of the items was 3.417 (on a 5 point scale) with a
recommendations offered to them on the website. These users          confidence interval of 0.390 (with alpha value of 0.05). The mean
were random Last.fm users that had received private messages         values of novelty and serendipity were also on the positive side
(advertising the user test) through the Last.fm messaging system.    with 3.667 and 3.625, respectively. The confidence intervals were
The new recommendation system was also advertised on various         0.436 (alpha = 0.05) for novelty and 0.350 (alpha = 0.05) for
Last.fm groups whose interests were in finding new music or          serendipity. The overall rating of the recommender system had a
those who were unsatisfied with current recommender systems          mean value of 3.458 with a confidence interval of 0.263 (alpha =
and their quite obvious recommendations. However, because the        0.05). In general, the results show that the proposed system has
users had to answer two surveys for two different sets, some         positive ratings and could be refined to produce better results.
appeared to have quit abruptly after finishing the first set. As a   The proposed system was rated higher in both novelty and
result, only 11 users out of 24 completed the second survey.         serendipity, compared to the second set of recommendations,
The private messages were sent to random Last.fm users who           which was a set of recommendations that was intended to imitate
satisfied two conditions: 1) the user used the “loved” function      existing systems such as Last.fm.
with his/her playlist, 2) The last time the user logged in was not   For this study, the parameters of the system were set with values
more than two weeks ago from the day the private messages were       that we thought produced the desired results after several
sent. Despite the advertisements and private messages, the           iterations of the algorithm. However, a full study focused on
response rate was extremely low (< 10 %). The results are shown      finding the optimal values for the parameters would be an
in Figures 6-8.                                                      excellent follow-up study and would greatly enhance the
                                                                     recommendations of the system.
                                                                     The score for the novelty of recommended items could have been
                                                                     higher, because the algorithm did not check whether the
                                                                     recommended songs existed in the user's library before being
                                                                     offered. Thus, the user would see some artists that they were
                                                                     aware of. As implied above, it is quite easy to increase the
                                                                     percentage of novel items in the entire recommendation list:
                                                                     simply check whether the artist exists in the user's library and if it
                                                                     does, exclude it from the recommendations. However, this step
                                                                     was excluded from the algorithm deliberately to increase the
                                                                     confidence of the users on the proposed system. The basis for this
Figure 6. Comparison of the relevance ratings for the two sets       was [20], in which the authors found that users liked to see
                                                                     familiar items in the recommendations, which ultimately led to an
                                                                     increase of user confidence in the system. Checking to see if the
                                                                     user is familiar with the recommended item would produce more
                                                                     "dense" novel recommendations.
                                                                     Regarding the novelty of items, an unforeseen problem was
                                                                     revealed after the user test. One user commented, "I have most of
                                                                     the bands recommended on my computer, I just haven't given
                                                                     them much of a listen to. Grizzly Bear in particular..." The
                                                                     problem here is whether, in this user's case, Grizzly Bear is a
                                                                     novel recommendation. The user states that s/he did not listen to
                                                                     many of the recommended artists, although those artists were in
                                                                     his/her library. Because the algorithm depends on the playcount of
                                                                     the songs in a user's library it is totally blind to tracks that reside
 Figure 7. Comparison of the novelty ratings for the two sets        in the library but have a playcount of 0. Thus, it recommends
                                                                     songs that it believes to be novel to the user, when it could in fact
exist in the library already. Unsurprisingly, the novelty and            [2] Nielsen Soundscan, State of the Industry. National
serendipity ratings from this user were low (a score of 2 for each),         Association of Recording Merchandisers, 2008
but the rating on the overall system was positive (a score of 4).        [3] Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. Using
Clarifying such issues on what a novel item is would help                    Collaborative Filtering to Weave an Information Tapestry.
improve the algorithm and the user's perception of the system.               Commun. ACM, 35(12):61-70, 1992. ISSN 0001-0782.
                                                                         [4] Resnick P., Iacovou, N., Suchak, M., Bergstrom, P., and
7. FUTURE RESEARCH                                                           Riedl, J. Grouplens: An Open Architecture for Collaborative
The most urgent and important future work on this particular                 Filtering of Netnews. In CSCW 1994, pages 175-186.
study would be to find the ideal parameter settings to produce the
desired recommendations. Due to the available time frame for this        [5] Shardanand, U. and Maes. P. Social Information Filtering:
study, much of the algorithm analysis including the settings of the          Algorithms for Automating “word of mouth”. In CHI `95,
parameters, were done manually, simply by iterating through                  pages 210-217.
different settings and observing the results. By finding the             [6] Hill, W., Stead, L., Rosenstein, M., and Furnas, G.
optimized values on parameters such as Expert Head Size, User                Recommending and Evaluating Choices in Virtual
Tail Size, and Item Weights, the quality of the recommendations              Community of Use. In CHI `95, pages 194-201.
in novelty and relevance would be greatly enhanced.                      [7] Celma, O. and Lamere, P. If you like the Beatles you might
Work on expanding the flexibility of the algorithm can also be               like…: a tutorial on music recommendation. ACM
done, creating additional parameters that bring changes to the               Multimedia, pages 1157-1158, ACM, 2008.
recommendations. More parameters would mean that the                     [8] Hu, X and Downie, J. S. Exploring mood metadata:
algorithm could be suited for each user's needs, bringing the                Relationship with genre, artist, and usage metadata. ,
possibility of creating an evermore-personalized set of                      September 2007.
recommendations.
                                                                         [9] Eck, D., Lamere, P., Bertin-Mahieux, T., and Green, S.
The overall system itself could be further developed to integrate            Automatic generation of social tags for music
content-based analysis for better results. Although the proposed             recommendation. In Advances in Neural Information
method is at its infancy, we believe that the only way to improve            Processing Systems 20. MIT Press, 2008.
it further (after it has fully developed independently) will be to       [10] Symeonidis, P., Ruxanda, M. M., Nanopoulos, A., and
incorporate a content-based algorithm to improve on its remaining             Manolopoulos, Y. Ternary semantic analysis of social tags
weaknesses as an algorithm that is based on user profiles.                    for personalized music recommendation. ISMIR, pages 219.
8. CONCLUSION                                                            [11] Celma, O. Foafing the music: Bridging the semantic gap in
In this paper, a novel approach to recommending unfamiliar artists            music recommendation. In Proceedings of the 5th
relative to each user was proposed in order to tackle the problem             International Semantic Web Conference, pages 927-934,
of the high density of obvious items in the list of                           Springer, 2006.
recommendations found in today's recommender systems. The key            [12] Celma, O. and Herrera, P. A new approach to evaluating
concept in this approach was that novel items did not always have             novel recommendations. In RecSys `08: pages 179-186, New
to be items that reside in the long tail of the popularity                    York, 2008.
distribution. Although novel or unfamiliar items, more often than
                                                                         [13] Celma, O. and Cano, P. From hits to niches?: or how popular
not, do indeed reside in the long tail of the popularity distribution,
                                                                              artists can bias music recommendation and discovery. In
it is important to acknowledge that even well-known artists could
                                                                              NETFLIX '08: Proceedings of the 2nd KDD Workshop on
be unknown to users who are (a) interested in different genres and
                                                                              Large-Scale Recommender Systems and the Netflix Prize
(b) are in different cultures and/or countries.
                                                                              Competition, pages 1-8, New York, NY, USA, 2008.
A system that produced recommendations was implemented and               [14] Celma, O. Music Recommendation and Discovery in the
was available online for users to use and rate. The                           Long Tail. PhD thesis.
recommendations were produced using data collected from
Last.fm. Results of the user surveys show that the proposed              [15] Pampalk, E. and Goto, M. Musicrainbow: A new user
system succeeds in providing novel recommendations to users,                  interface to discover artists using audio-based similarity and
while keeping those items also relevant. This study shows the                 web-based labeling. ISMIR, pages 367-370, 2006.
potential of such an approach to recommending novel items, while         [16] Duda, R. O. and Hart, P. E. Pattern classification and scene
maintaining a collaborative filtering algorithm without the support           analysis. New York, 1973.
from content-based algorithms.
                                                                         [17] Stone, M. Cross-validatory choice and assessment of
                                                                              statistical predictions. Roy. Stat. Soc., 36:111-147, 1974.
9. ACKNOWLEDGEMENTS                                                      [18] Herlocker, J. L., Konstan, J. A., and Riedl, J. T. Evaluating
The authors would like to thank Professor Sangki ‘Steve’ Han at               collaborative filtering recommendations. In Computer
the Graduate School of Culture Technology, KAIST and Sheayun                  Supported Cooperative Work, pages 241-250, 2000.
Lee for their valuable comments and feedback.
                                                                         [19] Schafer, J. B., Frankowski, D., Herlocker, J., and Sen, S.
                                                                              Collaborative filtering recommender systems, 2007.
10. REFERENCES
[1] Chris Anderson. The Long Tail: Why the Future of Business            [20] Singha, S., Rashmi, K. S., and Sinha, R. Beyond algorithms:
    Is Selling Less of More. Hyperion, 2006. ISBN 1401302378.                 An HCI perspective on recommender systems, 2001