Towards Effective Exploration/Exploitation in Sequential Music Recommendation Himan Abdollahpouri Steve Essinger DePaul University Pandora Media, Inc. USA USA habdolla@depaul.edu sessinger@pandora.com ABSTRACT has already recommended all the available people who match the Music streaming companies collectively serve billions of songs per user’s interest and, so, exploring a wider range of people is needed day. Radio-based music services may intersperse audio advertise- in order to be able to generate new recommendations. Therefore, ments among the songs as a means to generate revenue, much like providing exploratory content to a user is a key component for traditional FM radio. Regardless of the monetization approach, the discovery. We conducted an experiment on a music recommenda- recommender system should decide when to play content that the tion application and our results show that the previous sequence listener is known to enjoy (exploit) and content that is novel to of events in a listener’s session is important in deciding whether the listener (explore). Recommender systems that rely on this ex- the RS should provide subsequent exploratory types of content. plore/exploit type framework have been deployed in a wide variety of applications such as movies, books, music, shopping and more. In 2 SONG/AD SEQUENCE ANALYSIS this work, we investigate the impact of different ad/song sequences We have compiled data from a large-scale music recommendation on listener behavior. In particular, we focus on the impact of explor- service for our analysis. To find the effect of different sequences ing new song content for the listener given the previous sequence of songs and ads on the probability of a user switching the station of ads and songs in the listener’s session. Our results show that after listening to an exploratory song, we looked at one million the prior sequence matters when considering song exploration and sessions on mobile devices where the ad placement had been made that this prior sequence has an impact on the listener’s tendency completely at random. Note that the randomness of ad placement to interrupt their current session. is important in order to make sure our analysis is not biased toward any particular ad placement algorithm. We compare the impact of explore songs versus exploit songs in the context of the previous 1 INTRODUCTION three events. For example, given the prior three events Ad, Song, Recommender systems (RS) have been deployed in numerous do- Song, where each song is an exploit, what is the probability of mains including music, movies, e-commerce and books. In music the listener changing the station if the next song spun for them is recommendation, one of the overarching goals of the RS is to find an explore song versus the probability of station change given an the best song to play for each listener, personalized to their specific exploit song? Station change is used as a proxy for discontent with taste(s) in music. In general, companies offering music recommen- the current stream of music. dation services provide two different types of subscriptions: (1) We calculated the probabilities of users changing the station Ad-supported membership where the music is free, but the listener when they are exposed to different sequences of ads and songs as is subject to advertisements and (2) premium membership where listener pays a monthly membership fee in exchange for ad-free listening. This paper focuses on the former, ad-supported listening. Percent Increase of Station Change Explore versus Exploit Song, following Sequence Unsurprisingly, listeners prefer hearing songs over ads. However, 600 531 the business depends on the revenue that it makes from the ads 500 and cannot operate without serving them. Therefore, playing ads is crucial to keep the business alive and should be considered as a 400 Percent Increase content served to the listener along with music. One of the fundamental concepts in RS is the idea of exploration 300 and exploitation [9]. This paradigm results in a balance between rec- 208 196 ommending content the system has high certainty the user would 200 133 138 like (exploitation) and the content for which there is less certainty 99 104 100 (exploration). Without exploration, users would become stuck in a 64 filter bubble and continue to see a narrow set of products. This is 0 a missed opportunity to experience other products that could be S A S A S A S A S S A A S S A A S S S S A A A A of interest to them [5, 8]. Another reason for exploration is when Prior Event Sequence the number of items matching a user’s interest is limited and the system should not recommend the same item again to the user. Figure 1: Percent increase of the probability of station For example, in online dating [7], it is possible that the system change for an explore song vs. an exploit song, following RecSys 2017 Poster Proceedings, August 27-31, Como, Italy. different sequences of exploit songs and ads. RecSys 2017 Poster Proceedings, August 27-31, Como, Italy Himan Abdollahpouri and Steve Essinger follows: there are a total of 8 possible event combinations for a set used to make a balance between exploration and exploitation [10]. of three items as shown in figure 1. We denote explore song by S 0 Moreover, authors have previously proposed an approach for an ef- and exploit song by S. Station change is represented by, C. fective balance between recommending popular and long-tail items P(C | S 0) − P(C | S) [2]. A more similar idea to our work is done in [6] where authors Percentage Difference = ∗ 100 (1) investigated a proper timing for delivering the recommendation. P(C | S) However, in our work, we are not looking for a perfect timing for P(C | S 0) is the probability of a user changing the station given the the recommendation in general as the user always should receive a last played content is an explore song. P(C | S) is the probability of a content (song or ad) as recommendation. Our work is also novel as user changing the station when the last played content is an exploit we look at the previous sequences of the recommendations as an song. The lower and upper confidence bounds for the computed indication for whether it is a good time for exploration or not. percentage increases, shown in figure 1 as vertical blue lines on top of the bars, are computed as follows, 4 CONCLUSION AND FUTURE WORK  P(C | S 0) − P(C | S)  In this work, we investigated the impact of different ad/song se- ± 1.96 ∗ SE ∗ 100 (2) quences on listener behavior. In particular, we focused on the P(C | S) impact of exploring new song content for the listener given the where SE (i.e. the standard error) is calculated using equation 3, previous set of ads and songs in the listener’s session. Our ex- s perimental results show that the previous sequence of ads/songs P(C | S 0) ∗ (1 − P(C | S 0)) P(C | S) ∗ (1 − P(C | S)) matters in deciding what the right time is for exploration versus ex- + (3) ploitation. For our future work, we will launch an A/B experiment NS 0 NS controlling for the placement of explore songs and see how differ- where N S 0 is the total number of times an explore song has been ent users behave when they observe different sequences of songs played. The total number of times an exploit song has been played and ads. We will also investigate more sophisticated offline models, is denoted by N S . Figure 1 shows the percent increase of station such as HMMs and RNNs in a reinforcement learning setting that changes after playing an explore versus an exploit song when a user could learn superior personalized playlist sequencing. This work is has observed the respective prior sequence of exploit songs and ads. a starting point for a larger project in which we aim to optimize the Due to the company’s data privacy policy, we have not included stream of recommendations of mixed types of content (i.e. contents the individual probabilities of switching the station for explore from different stakeholders) [1, 3, 4]. and exploit songs, but have provided the probability difference of change. ACKNOWLEDGMENTS An exploit song is denoted by S and an ad is shown by A. As We would like to thank Pandora Media, Inc. for access to their you can see, depending upon the previous sequence of songs and vastly rich dataset. ads, the probability of a user switching the station when we show them an explore song is higher than the probability when we show REFERENCES an exploit song. This is true for all 8 different combinations of [1] Himan Abdollahpouri, Robin Burke, and Mobasher Bamshad. 2017. Recom- songs and ads. Moreover, some sequences are riskier than the oth- mender systems as multi-stakeholder environments. In Proceedings of the 25th ers for placing an explore song. For example, the ASA sequence Conference on User Modeling, Adaptation and Personalization (UMAP2017). ACM. [2] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling (which means playing an ad, then a song and then another ad) has Popularity Bias in Learning to Rank Recommendation. In Proceedings of the 11th the highest probability increase (+531.13%) of a user switching the ACM conference on Recommender systems. ACM, To appear. station when given an explore song after that sequence. Clearly, [3] Himan Abdollahpouri and Steve Essinger. 2017. Multiple stakeholders in a music recommender system. In 1st International Workshop on Value-Aware and this is not the best opportunity to explore new content. On the Multistakeholder Recommendation at RecSys 2017. other hand, the SAA sequence has the lowest probability increase [4] Robin Burke and Himan Abdollahpouri. 2017. Patterns of Multistakeholder Rec- ommendation. In 1st International Workshop on Value-Aware and Multistakeholder (+64.42%), but is still positive. While playing an explore song is still Recommendation at RecSys 2017. riskier than an exploit song in all cases, it is better to explore after [5] Oscar Celma. 2016. The Exploit-Explore Dilemma in Music Recommendation. In particular sequences over others. Certainly, different sequences of Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 377–377. [6] Nofar Dali Betzalel, Bracha Shapira, and Lior Rokach. 2015. Please, not now!: A songs and ads have different effects on station switching behavior model for timing recommendations. In Proceedings of the 9th ACM Conference on and a recommender system should try to take these sequences into Recommender Systems. ACM, 297–300. account when doing exploration and exploitation, as in our sequen- [7] Luiz Pizzato, Tomek Rej, Thomas Chung, Irena Koprinska, and Judy Kay. 2010. RECON: a reciprocal recommender for online dating. In Proceedings of the fourth tial music recommendation system. Overarching, instead of a blind ACM conference on Recommender systems. ACM, 207–214. explore-exploit platform, we advise taking an intelligent approach [8] Paul Resnick, R Kelly Garrett, Travis Kriplean, Sean A Munson, and Natalie Jo- mini Stroud. 2013. Bursting your (filter) bubble: strategies for promoting diverse that accounts for a listener’s state of listening (whether they are exposure. In Proceedings of the 2013 conference on Computer supported cooperative happy with the past couple of songs/ads or not) into account when work companion. ACM, 95–100. deciding to exploit or when to explore. [9] Hastagiri P Vanchinathan, Isidor Nikolic, Fabio De Bona, and Andreas Krause. 2014. Explore-exploit in top-n recommender systems via gaussian processes. In Proceedings of the 8th ACM Conference on Recommender systems. ACM, 225–232. 3 RELATED WORK [10] Xinxi Wang, Yi Wang, David Hsu, and Ye Wang. 2014. Exploration in interac- tive personalized music recommendation: a reinforcement learning approach. The idea of explore-exploit has been studied in recommender sys- ACM Transactions on Multimedia Computing, Communications, and Applications tems by some researchers [9]. In particular, for single item rec- (TOMM) 11, 1 (2014), 7. ommendation, approaches like Multi-Armed Bandits have been