30Music listening and playlists dataset

                                         Roberto Turrin                        Massimo Quadrana
                                         ContentWise R&D                       DEIB, Politecnico di Milano
                                    roberto.turrin@contentwise.tv              massimo.quadrana@polimi.it

                Andrea Condorelli                            Roberto Pagano                                  Paolo Cremonesi
                   ContentWise R&D                        DEIB, Politecnico di Milano                      DEIB, Politecnico di Milano
            andrea.condorelli@contentwise.tv               roberto.pagano@polimi.it                         paolo.cremonesi@polimi.it


ABSTRACT                                                                  content information (e.g., metadata, tags, acoustic features),
We introduce the 30Music dataset1 , a collection of listen-               but only a few report some user-system interactions (e.g.,
ing and playlists data retrieved from Internet radio stations             ratings, play events) useful to profile users and to experiment
through Last.fm API. In this paper we describe the cre-                   with personalization tasks.
ation process, its content, and its possible uses. Attractive                The Million song dataset (MSD) [1] is a public collec-
features of the 30Music dataset that differentiate it from ex-            tion well-known for its size. In fact, it contains audio fea-
isting public datasets include, among the others, (i) the user            tures (e.g., pitches, timbre, loudness, as provided by the
listening sessions complete of contextual time information,               Echo Nest Analyze API2 ) and textual metadata (e.g., Mu-
(ii) the user playlists, and (iii) the positive user ratings, key         sicbrainz3 tags, Echo Nest tags, Last.fm tags) about 1M
information to experiment with the task of modeling user                  songs (related to 44K artists).
taste and recommending playlists.                                            Celma [2] has published two music datasets collected from
                                                                          Last.fm API: 1K-user and 360K-user. The smallest one - 1K-
                                                                          user dataset - contains the user listening habits (20M play
1.     INTRODUCTION                                                       events) of less than 1K users. On the other hand, the biggest
   Several challenges in the music domain have been only                  one - the 360K-user dataset - collects the information about
partially explored due to the scarcity of data available to re-           360K users, but it does not have any listening data other
searchers for experiments. For instance, tasks such as user               than the number of times a user has listened to an artist.
modeling and playlist recommendation require implicit con-                Data are provided as downloaded from the Last.fm API.
textual information about listening events (e.g., user, track,               Yahoo! Labs have released several music datasets4 . For
time, duration), explicit information about user preferences              instance, the R1 and the R2 datasets provides ratings on
(e.g., loved tracks, playlists), and user listening sessions.             artists and songs, respectively, but not user play events.
   In this paper we introduce the 30Music dataset, a freely-                 Some datasets have been extracted from microblogs, such
available music dataset designed to overcome these prob-                  as the Million Musical Tweets Dataset [3]. Finally, The Art
lems. The main innovative aspects of the 30Music dataset                  of the Mix Playlist dataset5 , consists of 29K user-contributed
with respect to the existing public datasets are:                         playlists, containing 218K distinct songs for 60K distinct
                                                                          artists. However, there are no user listening events.
     • the dataset contains both implicit play events and ex-
       plicit user ratings (i.e., preferred tracks);
                                                                          3.        DATASET CREATION
     • the dataset contains user-generated playlists;
                                                                             The 30Music dataset has been obtained via Last.fm pub-
     • play events are organized into listening sessions;                 lic API6 . Last.fm provides free API to track details of user
                                                                          listening sessions. In the case a user has connected his sup-
     • whenever a user plays a track from a playlist, the play            ported player to his Last.fm account, the player “scrobbles”
       event is tagged.                                                   the user listening activity, i.e., it transfers the play event to
  The rest of the paper is organized as follows. Section 2                Last.fm that records such user interaction. It is worth not-
discusses the existing music datasets. Section 3 presents the             ing that only listening events are recorded, while pause/skip
process implemented to crawl the data from Last.fm and                    events are not scrobbled from the user player to Last.fm,
create the dataset, whose main characteristics are explored               as well as any playlist or explicit preference defined or ex-
in Section 4. Finally, Section 5 draws the conclusions and                pressed in the player. The main way for a user to create a
discusses future work.                                                    playlist in Last.fm is to access to the website; similarly, the
                                                                          user can express explicit preferences (‘love’) about tracks
                                                                          directly in the website. As a consequence, explicit ratings
2.     RELATED DATASETS                                                   and playlists stored in Last.fm are not biased by external
  There exist a number of publicly-available music datasets,              systems (e.g., the recommendations proposed in the player).
used in several music experiments. Most datasets provide                  2
                                                                              http://the.echonest.com/
1                                                                         3
    http://recsys.deib.polimi.it/?page_id=54                                  https://musicbrainz.org/
                                                                          4
                                                                              http://webscope.sandbox.yahoo.com/catalog.php
                                                                          5
                                                                              http://labrosa.ee.columbia.edu/projects/musicsim/aotm.html
Copyright is held by the author(s).                                       6
RecSys 2015 Poster Proceedings, September 16-20, 2015, Austria, Vienna.       http://www.last.fm/api
unfortunately, last.fm requires to scrobble only tracks played                             preferences spanning many different tracks, but their listen-
for at least half their duration (or for at least 4 minutes), so                           ing behaviour is biased toward the most popular tracks.
events not matching these conditions - such as skip events -                                  A similar analysis has been performed by aggregating the
are not confident (although many events less than 5 seconds                                tracks of the same artist. Differently from tracks, these play
have been found in the collected data).                                                    events present a strong long-tail distribution: the 20% most
   To build the 30Music dataset, we started from a list of 2M                              popular artists collect more than 95% of the play events.
Last.fm usernames from the Chris Meller dataset 7 . Given                                  This long tail effect is strongly mitigated when analyzing
the list of users, we retrieved their playlists (User.getPlay-                             preferred tracks. The same percentage of play events (95%)
lists) together with the single tracks composing the playlists                             involve 50% of the artists when considering tracks in the
(Playlist.fetch). Starting from users with at least one                                    playlists. We can deduce that users have preferences span-
playlist (about 90K users), we retrieved (User.getRecent-                                  ning many different artists, but their listening behaviour is
Tracks) the user listening events over a 1-year time win-                                  strongly biased toward the most popular artist.
dow (from Mon, 20 Jan 2014 09:24:19). The raw playlists                                       An analysis of the empirical cumulative explicit like dis-
and user listening events have been enriched with additional                               tribution as a function of the (percentage) number of tracks
information both about users (User.getInfo) and tracks                                     and artists shows that only the 14.73% of the tracks and
(Track.getInfo).                                                                           the 19.93% of the artists have received at least one explicit
   Furthermore, the data downloaded with the Last.fm API                                   preference. We observed that the distribution of the explicit
has been processed using Python scripts exploiting some                                    ratings within tracks does not exhibit a strong long-tail be-
Apache Spark functions for a distributed processing of the                                 havior. The 5% of the most popular tracks collect the 75%
massive amount of data. In order to keep only complete and                                 of the explicit ratings. On the other hand, the number of
reliable data, we discarded users with some missing data                                   explicit ratings is strongly skewed toward a few very pop-
(e.g., if the track scrobbled by the user has the wrong meta-                              ular artists. The 5% of the of most popular artists collect
data and it is not recognized by Last.fm, the whole user is                                more than the 90% of the explicit ratings. These results con-
discarded). In this way, we maintained only the half of the                                firm our previous intuitions over users’ listening behaviour.
users that have complete information.                                                      Users tend to love (and to listen to) a few very popular
   Finally, we defined a new entity, the user play session,                                artists. However, their preference spans across several tracks
as an ordered list of play events that are assumed to be                                   of these very popular artists. Still, they tend to provide ex-
consequently listened by the user with no interruptions. We                                plicit rating for few of the tracks they have listened to. This
define a play event to be part of a session if it occurs no later                          can be due to the mechanism adopted by Last.fm to col-
than 800 seconds after the previous user play event. This                                  lect explicit feedback, which forces users to move from their
processing required, for each user listening event, to compute                             usual music player, to access to the Last.fm online service
the play time, together with the ratio of track effectively                                and to provide their “love” to a track there. This clearly
listened by the user.                                                                      imposes a heavy burden over users, but on the other hand
                                                                                           it enhances the strength of each explicit rating, because it
30Music format.                                                                            is a clear expression of the willingness of the specific user to
   The 30Music dataset is released in accordance with the                                  provide that feedback.
[anonimized for double blind review] data format, a multi-
graph representation oriented to recommender system eval-                                  5.   CONCLUSIONS AND FUTURE WORK
uation that explicitly represents entities (i.e., nodes) and                                  In this paper, we presented the 30Music dataset, a music
relations (i.e., edges).                                                                   dataset consisting of both user interactions (i.e., user play
   Entity model any object that can be recommended. The                                    sessions) and user explicit preferences (i.e., playlists, pre-
dataset is formed by 45K users, 5.6M tracks, 50K playlists,                                ferred tracks). The dataset is made available to the research
600K artists, 200K albums, and 280K tags. Relations model                                  community and we expect it will foster the exploration of the
links between two (or more) entities. We defined 31M user                                  several challenges still open in the settings of online music
play events, 2.7M user play sessions, and 4.1M user love                                   applications.
preferences.
                                                                                           Acknoledgements.
4.        DATASET ANALYSIS                                                                 The research leading to these results was performed in the
  The dataset contains 31,351,954 play events organized into                               CrowdRec project, which has received funding from the Eu-
2,764,474 sessions (an average of 11 play events per session).                             ropean Union Seventh Framework Programme FP7/2007-
The dataset contains also 4,106,341 explicit ratings (loved                                2013 under grant agreement n. 610594.
tracks), with an average of 33 ratings per user, and 57,561
user-created playlists. The number of events without track                                 6.   REFERENCES
duration is 1,277,893 (4.08%).                                                             [1] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and
  We can observe that play events present a moderate long-                                     P. Lamere. The million song dataset. 12th Int. Conf. on
tail distribution: the 20% most popular tracks collect 80%                                     Music Information Retrieval (ISMIR), 2011.
of the play events. This long tail effect is mitigated by fo-                              [2] O. Celma. Music Recommendation and Discovery in the
cusing on preferred tracks (i.e., loved tracks and tracks in                                   Long Tail. Springer, 2010.
the playlists). We observe that the same percentage of play
                                                                                           [3] D. Hauger, M. Schedl, A. Kosir, and M. Tkalcic. The
events (80%) involves twice the tracks (40%) when consid-
                                                                                               million musical tweet dataset - what we can learn from
ering tracks in the playlists. We can deduce that users have
                                                                                               microblogs. In Proc. of the 14th Int. Society for Music
7                                                                                              Information Retrieval Conference, Nov 4-8 2013.
    https://opendata.socrata.com/Business/Two- Million- LastFM-User- Profiles/5vvd- truf