30Music listening and playlists dataset Roberto Turrin Massimo Quadrana ContentWise R&D DEIB, Politecnico di Milano roberto.turrin@contentwise.tv massimo.quadrana@polimi.it Andrea Condorelli Roberto Pagano Paolo Cremonesi ContentWise R&D DEIB, Politecnico di Milano DEIB, Politecnico di Milano andrea.condorelli@contentwise.tv roberto.pagano@polimi.it paolo.cremonesi@polimi.it ABSTRACT content information (e.g., metadata, tags, acoustic features), We introduce the 30Music dataset1 , a collection of listen- but only a few report some user-system interactions (e.g., ing and playlists data retrieved from Internet radio stations ratings, play events) useful to profile users and to experiment through Last.fm API. In this paper we describe the cre- with personalization tasks. ation process, its content, and its possible uses. Attractive The Million song dataset (MSD) [1] is a public collec- features of the 30Music dataset that differentiate it from ex- tion well-known for its size. In fact, it contains audio fea- isting public datasets include, among the others, (i) the user tures (e.g., pitches, timbre, loudness, as provided by the listening sessions complete of contextual time information, Echo Nest Analyze API2 ) and textual metadata (e.g., Mu- (ii) the user playlists, and (iii) the positive user ratings, key sicbrainz3 tags, Echo Nest tags, Last.fm tags) about 1M information to experiment with the task of modeling user songs (related to 44K artists). taste and recommending playlists. Celma [2] has published two music datasets collected from Last.fm API: 1K-user and 360K-user. The smallest one - 1K- user dataset - contains the user listening habits (20M play 1. INTRODUCTION events) of less than 1K users. On the other hand, the biggest Several challenges in the music domain have been only one - the 360K-user dataset - collects the information about partially explored due to the scarcity of data available to re- 360K users, but it does not have any listening data other searchers for experiments. For instance, tasks such as user than the number of times a user has listened to an artist. modeling and playlist recommendation require implicit con- Data are provided as downloaded from the Last.fm API. textual information about listening events (e.g., user, track, Yahoo! Labs have released several music datasets4 . For time, duration), explicit information about user preferences instance, the R1 and the R2 datasets provides ratings on (e.g., loved tracks, playlists), and user listening sessions. artists and songs, respectively, but not user play events. In this paper we introduce the 30Music dataset, a freely- Some datasets have been extracted from microblogs, such available music dataset designed to overcome these prob- as the Million Musical Tweets Dataset [3]. Finally, The Art lems. The main innovative aspects of the 30Music dataset of the Mix Playlist dataset5 , consists of 29K user-contributed with respect to the existing public datasets are: playlists, containing 218K distinct songs for 60K distinct artists. However, there are no user listening events. • the dataset contains both implicit play events and ex- plicit user ratings (i.e., preferred tracks); 3. DATASET CREATION • the dataset contains user-generated playlists; The 30Music dataset has been obtained via Last.fm pub- • play events are organized into listening sessions; lic API6 . Last.fm provides free API to track details of user listening sessions. In the case a user has connected his sup- • whenever a user plays a track from a playlist, the play ported player to his Last.fm account, the player “scrobbles” event is tagged. the user listening activity, i.e., it transfers the play event to The rest of the paper is organized as follows. Section 2 Last.fm that records such user interaction. It is worth not- discusses the existing music datasets. Section 3 presents the ing that only listening events are recorded, while pause/skip process implemented to crawl the data from Last.fm and events are not scrobbled from the user player to Last.fm, create the dataset, whose main characteristics are explored as well as any playlist or explicit preference defined or ex- in Section 4. Finally, Section 5 draws the conclusions and pressed in the player. The main way for a user to create a discusses future work. playlist in Last.fm is to access to the website; similarly, the user can express explicit preferences (‘love’) about tracks directly in the website. As a consequence, explicit ratings 2. RELATED DATASETS and playlists stored in Last.fm are not biased by external There exist a number of publicly-available music datasets, systems (e.g., the recommendations proposed in the player). used in several music experiments. Most datasets provide 2 http://the.echonest.com/ 1 3 http://recsys.deib.polimi.it/?page_id=54 https://musicbrainz.org/ 4 http://webscope.sandbox.yahoo.com/catalog.php 5 http://labrosa.ee.columbia.edu/projects/musicsim/aotm.html Copyright is held by the author(s). 6 RecSys 2015 Poster Proceedings, September 16-20, 2015, Austria, Vienna. http://www.last.fm/api unfortunately, last.fm requires to scrobble only tracks played preferences spanning many different tracks, but their listen- for at least half their duration (or for at least 4 minutes), so ing behaviour is biased toward the most popular tracks. events not matching these conditions - such as skip events - A similar analysis has been performed by aggregating the are not confident (although many events less than 5 seconds tracks of the same artist. Differently from tracks, these play have been found in the collected data). events present a strong long-tail distribution: the 20% most To build the 30Music dataset, we started from a list of 2M popular artists collect more than 95% of the play events. Last.fm usernames from the Chris Meller dataset 7 . Given This long tail effect is strongly mitigated when analyzing the list of users, we retrieved their playlists (User.getPlay- preferred tracks. The same percentage of play events (95%) lists) together with the single tracks composing the playlists involve 50% of the artists when considering tracks in the (Playlist.fetch). Starting from users with at least one playlists. We can deduce that users have preferences span- playlist (about 90K users), we retrieved (User.getRecent- ning many different artists, but their listening behaviour is Tracks) the user listening events over a 1-year time win- strongly biased toward the most popular artist. dow (from Mon, 20 Jan 2014 09:24:19). The raw playlists An analysis of the empirical cumulative explicit like dis- and user listening events have been enriched with additional tribution as a function of the (percentage) number of tracks information both about users (User.getInfo) and tracks and artists shows that only the 14.73% of the tracks and (Track.getInfo). the 19.93% of the artists have received at least one explicit Furthermore, the data downloaded with the Last.fm API preference. We observed that the distribution of the explicit has been processed using Python scripts exploiting some ratings within tracks does not exhibit a strong long-tail be- Apache Spark functions for a distributed processing of the havior. The 5% of the most popular tracks collect the 75% massive amount of data. In order to keep only complete and of the explicit ratings. On the other hand, the number of reliable data, we discarded users with some missing data explicit ratings is strongly skewed toward a few very pop- (e.g., if the track scrobbled by the user has the wrong meta- ular artists. The 5% of the of most popular artists collect data and it is not recognized by Last.fm, the whole user is more than the 90% of the explicit ratings. These results con- discarded). In this way, we maintained only the half of the firm our previous intuitions over users’ listening behaviour. users that have complete information. Users tend to love (and to listen to) a few very popular Finally, we defined a new entity, the user play session, artists. However, their preference spans across several tracks as an ordered list of play events that are assumed to be of these very popular artists. Still, they tend to provide ex- consequently listened by the user with no interruptions. We plicit rating for few of the tracks they have listened to. This define a play event to be part of a session if it occurs no later can be due to the mechanism adopted by Last.fm to col- than 800 seconds after the previous user play event. This lect explicit feedback, which forces users to move from their processing required, for each user listening event, to compute usual music player, to access to the Last.fm online service the play time, together with the ratio of track effectively and to provide their “love” to a track there. This clearly listened by the user. imposes a heavy burden over users, but on the other hand it enhances the strength of each explicit rating, because it 30Music format. is a clear expression of the willingness of the specific user to The 30Music dataset is released in accordance with the provide that feedback. [anonimized for double blind review] data format, a multi- graph representation oriented to recommender system eval- 5. CONCLUSIONS AND FUTURE WORK uation that explicitly represents entities (i.e., nodes) and In this paper, we presented the 30Music dataset, a music relations (i.e., edges). dataset consisting of both user interactions (i.e., user play Entity model any object that can be recommended. The sessions) and user explicit preferences (i.e., playlists, pre- dataset is formed by 45K users, 5.6M tracks, 50K playlists, ferred tracks). The dataset is made available to the research 600K artists, 200K albums, and 280K tags. Relations model community and we expect it will foster the exploration of the links between two (or more) entities. We defined 31M user several challenges still open in the settings of online music play events, 2.7M user play sessions, and 4.1M user love applications. preferences. Acknoledgements. 4. DATASET ANALYSIS The research leading to these results was performed in the The dataset contains 31,351,954 play events organized into CrowdRec project, which has received funding from the Eu- 2,764,474 sessions (an average of 11 play events per session). ropean Union Seventh Framework Programme FP7/2007- The dataset contains also 4,106,341 explicit ratings (loved 2013 under grant agreement n. 610594. tracks), with an average of 33 ratings per user, and 57,561 user-created playlists. The number of events without track 6. REFERENCES duration is 1,277,893 (4.08%). [1] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and We can observe that play events present a moderate long- P. Lamere. The million song dataset. 12th Int. Conf. on tail distribution: the 20% most popular tracks collect 80% Music Information Retrieval (ISMIR), 2011. of the play events. This long tail effect is mitigated by fo- [2] O. Celma. Music Recommendation and Discovery in the cusing on preferred tracks (i.e., loved tracks and tracks in Long Tail. Springer, 2010. the playlists). We observe that the same percentage of play [3] D. Hauger, M. Schedl, A. Kosir, and M. Tkalcic. The events (80%) involves twice the tracks (40%) when consid- million musical tweet dataset - what we can learn from ering tracks in the playlists. We can deduce that users have microblogs. In Proc. of the 14th Int. Society for Music 7 Information Retrieval Conference, Nov 4-8 2013. https://opendata.socrata.com/Business/Two- Million- LastFM-User- Profiles/5vvd- truf