=Paper=
{{Paper
|id=Vol-2450/paper1
|storemode=property
|title=How Playlist Evaluation Compares to Track Evaluations in Music Recommender Systems
|pdfUrl=https://ceur-ws.org/Vol-2450/paper1.pdf
|volume=Vol-2450
|authors=Sophia Hadash,Yu Liang,Martijn C. Willemsen
|dblpUrl=https://dblp.org/rec/conf/recsys/HadashLW19
}}
==How Playlist Evaluation Compares to Track Evaluations in Music Recommender Systems==
How playlist evaluation compares to track evaluations in music recommender systems Sophia Hadash Yu Liang Martijn C. Willemsen Jheronimus Academy of Data Science Jheronimus Academy of Data Science Eindhoven University of Technology 5211 DA ’s-Hertogenbosch, The 5211 DA ’s-Hertogenbosch, The 5600 MB Eindhoven, The Netherlands Netherlands Netherlands Jheronimus Academy of Data Science s.hadash@tue.nl y.liang1@tue.nl 5211 DA ’s-Hertogenbosch, The Netherlands m.c.willemsen@tue.nl ABSTRACT ACM Reference Format: Most recommendation evaluations in music domain are focused Sophia Hadash, Yu Liang, and Martijn C. Willemsen. 2019. How playlist evaluation compares to track evaluations in music recommender systems. on algorithmic performance: how a recommendation algorithm In Proceedings of Joint Workshop on Interfaces and Human Decision Making could predict a user’s liking of an individual track. However, indi- for Recommender Systems (IntRS ’19). CEUR-WS.org, 9 pages. vidual track rating might not fully reflect the user’s liking of the whole recommendation list. Previous work has shown that sub- jective measures such as perceived diversity and familiarity of the recommendations, as well as the peak-end effect can influence the 1 INTRODUCTION user’s overall (holistic) evaluation of the list. In this study, we inves- In user-centric evaluation of personalized music recommendation, tigate how individual track evaluation compares to holistic playlist users are usually asked to indicate their degree of liking of individ- evaluation in music recommender systems, especially how playlist ual tracks [2, 4, 5, 14] or by providing a holistic assessment of the attractiveness is related to individual track rating and other subjec- entire playlist (e.g. playlist satisfaction or playlist attractiveness) tive measures (perceived diversity) or objective measures (objective [6, 9, 12, 16, 17] generated by the recommendation approaches. familiarity, peak-end effect and occurrence of good recommenda- Most recommender evaluations are focused on the first type of tions in the list). We explore this relation using a within-subjects evaluation to test algorithmic performance: can we accurately pre- online user experiment, in which recommendations for each condi- dict the liking of an individual track. Many user-centric studies in tion are generated by different algorithms. We found that individual the field [15] however focus on the second metric, does the list of track ratings can not fully predict playlist evaluations, as other fac- recommendations provide a satisfactory experience. Often these tors such as perceived diversity and recommendation approaches studies find playlist satisfaction is not just about the objective or can influence playlist attractiveness to a larger extent. In addition, subjective accuracy of the playlist, but also depends on the difficulty inclusion of the highest and last track rating (peak-end) is equally of choosing from the playlist or playlist diversity [27]. For example, good in predicting playlist attractiveness as the inclusion of all in the music domain perceived diversity of the playlist has been track evaluations. Our results imply that it is important to consider shown to have a negative effect on overall playlist attractiveness [7]. which evaluation metric to use when evaluating recommendation Bollen et al. [3] showed people were just as satisfied with a list of approaches. 20 movie recommendations which included the top-5 list and a set of lower ranked items (twenty’s item being the 1500th best rank) CCS CONCEPTS as with a list of the best 20 recommendations (top-20 ranked). Research in psychology also shows that people’s memory of • Human-centered computing → Empirical studies in HCI; overall experience is influenced by the largest peak and end of Heuristic evaluations; • Information systems → Recommender the experience rather than the average of the moment to moment systems; Relevance assessment; Personalization. experience [13]. Similar effects might occur when we ask users to evaluate holistically a list on attractiveness: they might be triggered KEYWORDS more by particular items in the list (i.e. ones that they recognize User-centric evaluation, recommender systems, playlist and track as great (or bad), ones that are familiar rather than ones that are evaluation unknown, cf. mere exposure effect [25]) and therefore their overall impression might not simply be the mean of the individual ratings. These results from earlier recommender research and from psy- chological research suggest that overall (holistic) playlist evaluation is not just reflected by the average of liking or rating of the individ- IntRS ’19, September 19, 2019, Copenhagen, Denmark Copyright © 2019 for this paper by its authors. Use permitted under Creative Com- ual items. However, to our best knowledge, no previous work has mons License Attribution 4.0 International (CC BY 4.0). IntRS ’19: Joint Workshop explored the relation between users’ evaluation of individual tracks on Interfaces and Human Decision Making for Recommender Systems, 19 Sept 2019, and overall playlist evaluation. To some extent this is because it is Copenhagen, DK . not common that both types of data are collected in the same study. Therefore, in this work, we would like to investigate how individual IntRS ’19, September 19, 2019, Copenhagen, Denmark Hadash et al. item evaluations relate to holistic evaluations in sequential music list can help users to discover new music and enrich their music recommender systems. tastes. We explore these relations using a within-subject online experi- The novel contribution of this work is that we include both mea- ment, in which users are asked to give individual ratings as well as surements in the study for personalized music recommendations, overall perception of playlist attractiveness and diversity in three aiming to uncover the relation between individual track evaluation conditions: (1) track and artist similarity algorithm (base), (2) track and holistic evaluation of music playlists. and artist similarity algorithm combined with genre similarity algo- rithm (genre) and (3) track and artist similarity algorithm combined 2.2 Peak-end effect with audio feature algorithm (gmm). The track and artist similarity Research in psychology has looked into the differences between algorithm can be regarded as a low-spread strategy since recom- the ‘remembering self’ and the ‘experiencing self’ [13], as reflected mendations are generated from a small subset of the total pool in the peak-end rule: the memory of the overall experience of a of tracks relatively close to the user’s tastes [8]. Both the genre painful medical procedure is not simply the sum or average of the approach and gmm approach are high-spread strategies which gen- moment to moment experience, but the average of the largest peak erates user-track ratings for a large proportion of the total pool of and the end of the experience. tracks. In music domain, several studies have found that the remem- In this study, we are interested in how perceived attractiveness of bered intensity of the music listening experience is highly corre- the playlist is related to perceived playlist diversity and individual lated with peak, peak-end and average moment-to-moment expe- track ratings across the three conditions. In addition, we also include rience [21, 22]. However, it is argued by Wiechert [26] that these a set of objective features of the playlist in the analysis. We test studies fail to consider users’ personal musical preferences and whether that users’ perceived attractiveness of the playlist will that the peak-end value and the average value measured in the also be affected by (1) the peak-end effect: the track they like most studies might be correlated with each other. Rather than giving and the end track, (2) their familiarity to the recommendations in participants the same stimuli, Wiechert gave participants a list of the playlist and (3) occurrences of good recommendations in the songs based on their current musical preference and came up with a playlist: people might be satisfied with a playlist as long as at least new metric: pure peak-end value (the difference between peak-end some recommendations are good. and average). He found that while the average experience could explain a significant part of playlist experience variance, the pure 2 RELATED WORK peak-end value could explain a part of variance that would not be 2.1 User-centric evaluation in music explained by the average. recommendation 3 METHOD User-centric evaluation for recommendation approaches is neces- sary in order to understand users’ perception of the given recom- In this study three algorithms are used for generating playlists. mendations [15], such as acceptance or satisfaction [23, 24]. These algorithms are designed to use user preferences in the form User-centic evaluation in music recommendation can be at indi- of (ordered) lists of tracks, artists, or genres a user is known to vidual track level or whole playlist level. Users’ perception towards like. The advantage of using such an input form is that these algo- the whole playlists are often measured under the context of auto- rithms can be used with user preferences obtained from commercial matic playlist generation [20], smooth track transition [9] or when platforms. In this study Spotify 1 user profiles are used. These pref- the goal is to evaluate the whole recommender system [12, 17]. For erences are in the form of ordered lists of top tracks and artists. The example, users were asked to indicate their perception towards first algorithm is based on track and artist similarity. The second al- the recommended playlists to investigate how different settings of gorithm uses a genre similarity metric based on genre co-occurrence control in the recommender system influence their cognitive load among artists. The third algorithm recommends tracks based on as well as their acceptance to the recommendations [12]. However, a Gaussian mixture model on track features derived from audio when it comes to the evaluation of the recommendation algorithms, analyses (see [10] for details). All algorithms are described in detail users are often asked to indicate their ratings[2, 4] for each indi- in [8]. vidual track rather than the playlist as a whole, neglecting the fact that tracks are often listened in succession or within a playlist. 3.1 Track and artist similarity algorithm Individual item ratings can not fully reflect users’ degree of lik- The track and artist similarity algorithm is a combination of the ing towards the recommendation list. Perceived diversity is a factor same sub-algorithm applied to both a list of tracks and artists a user that can only be measured at the list level. Willemsen et al. [27] is known to like. The input to this sub-algorithm is a list of items, has shown that perceived diversity of a movie recommendation list potentially ordered on user likeability. This sub-algorithm uses has a positive effect on perceived list attractiveness and a higher Spotify’s seed recommendation system to explore items that are perceived diversity would make it easier for users to make a choice similar to the input. Based on the occurrence of items in the results, from the recommendations. Ekstrand et al. [6] also show that per- an output list is generated with user likeability prediction scores. ceived diversity has a positive effect on user satisfaction. While in The algorithm is formulated in Algorithm 1, with an illustration by music domain, Ferwerda et al. [7] found that perceived diversity has example in Figure 1. a negative effect on perceived attractiveness of the recommendation list, however, this effect turns to positive when the recommendation 1 https://developer.spotify.com/documentation/web-api/ How playlist evaluation compares to track evaluations in music recommender systems IntRS ’19, September 19, 2019, Copenhagen, Denmark Figure 1: Illustration of the track and artist similarity algorithm using an example. Algorithm 1 Track and artist similarity was known which genres he/she produced music in. The data is si : score of item i, x: temporary score of item i, recs: extracted from Spotify’s developer API. The co-occurence analysis recommendation set, N : number of sibling nodes, posnod e : generated a normalized symmetric similarity matrix D. The like- position of the node in its parents’ children, smin , smax : scores ability scores of the user towards the list of genre is then computed assigned to the last and first sibling node at the current tree depth. as follows, where I is the identity matrix. 1: for each item i in recs do S u = (D + I )S u′ (1) 2: si = 0 3: for each node j as current_node where node j is item i do 3.3 Audio feature algorithm 4: x = current_node.score The audio feature algorithm clusters tracks with similar audio 5: while current_node.parent not null do features using a Gaussian mixture model (GMM). A database of 6: current_node = current_node.parent n ≈ 500.000 tracks containing 11 audio analysis features were used 7: x = x ∗ current_node.score to train the model. The audio features consisted of measures for 8: end while danceability, energy, key, loudness, mode, speechiness, acousticness, 9: si = si + x instrumentalness, liveness, valence, and tempo. Multiple GMM’s 10: end for were fitted using the expectation-maximization (EM) algorithm for 11: end for varying component numbers. The model with 21 components had 12: return (recs, s) order by s descending the lowest BIC criterion and was therefore selected. Then, cluster 13: likeability was computed as follows (see [8]): 14: def node.score: N −pos nod e 15: return N (smax − smin ) + smin NÕ t op 1 p(user likes cluster i) = p(track j belonдs to cluster i) Ntop j=1 3.2 Genre similarity algorithm (2) Finally, the output recommendations favored tracks correspond- The genre similarity algorithm uses an ordered list of genres the ing to clusters with high user likeability probabilities. user likes S u′ (a column vector which shows the user degree of liking to all genres built from the user’s top artists) and a similarity metric D to generate genre likeability scores for other genres. Then, 3.4 Familiarity of the recommendations to the the resulting extrapolated list S u is used to favor recommendations users from genres with high likeability scores. Both the track and artist similarity and the genre similarity al- There are 1757 different types of genres available in our dataset, gorithms generate recommendations close to the users’ known therefore both S u′ and S u are column vectors of dimension 1757 preferences. Recommendations are based on artists and genres that and matrix D is of dimension 1757 × 1757. are familiar to the user. The audio feature algorithm on the other The similarity metric is based on co-occurrence analysis of artists, hand recommends tracks based on audio feature similarity. As a similar to the methodology used in [19]. The co-occurrence analysis result, recommended tracks are more likely to have genres and used a database consisting of n ≈ 80.000 artists. For each artist it artists that are less familiar to the users. IntRS ’19, September 19, 2019, Copenhagen, Denmark Hadash et al. 4 EXPERIMENTAL DESIGN 4.3 Study Procedure To evaluate the relation between track evaluations and playlist After consenting, participants were prompted with a login screen evaluations, a within-subjects online experiment was conducted. where they could connect their Spotify account with the study. Par- The study included three conditions in randomized order: track and ticipants who did not have a Spotify account or who had a Spotify artist algorithm (base), track and artist algorithm combined with account containing no user preference data could not continue with the genre similarity algorithm (genre), and track and artist algo- the study. After Spotify login, participants completed a background rithm combined with the audio feature algorithm (gmm). In each survey. In the survey they reported their Spotify usage and music condition participants were presented with a playlist containing 10 sophistication. tracks generated by the corresponding algorithm and evaluated the Following the background survey, the user entered the track individual tracks on likeability and personalization and the playlist evaluation phase in which a playlist was presented to the user as a whole on attractiveness and diversity. The playlist included generated by one of the algorithms. The interface (see Figure 2) the top 3 recommendations and the 20th , 40th , 60th , 80th , 100th , contained an interactive panel showing the tracks of the playlist, a 200th , and 300th recommendation in random order. Lower ranked survey panel in which they had to rate the tracks, and a music con- recommendations were included such that algorithm performance trol bar. Participants could freely browse through the playlist while could be evaluated more easily, as lower ranked recommendations providing the ratings. After all ratings were provided, participants should result in lower user evaluations. entered the playlist evaluation phase in which they answered the playlist evaluations questions (Table 1). The track evaluation phase 4.1 Participant Recruitment and playlist evaluation phase were then repeated for the remaining Participants were primarily recruited using the JF Schouten partici- conditions. pant database of Eindhoven University of Technology. Some par- Finally, participants were thanked for their time and were en- ticipants were recruited by invitation. Participants were required tered into a reward raffle. Among every 5 participants one par- to have a Spotify account (free or Premium) and to have used this ticipant received 15 euro compensation. In total the study lasted account prior to taking part in the study. approximately 15 minutes. 4.2 Materials The track evaluations included likeability and personalization mea- sures. One question was used for each of the tracks. This was decided based on the repetitive nature of individual track evalua- tions. The questions for measuring track likeability was: "Rate how much you like the song". For measuring perceived track personaliza- tion we used the following item: "Rate how well the song fits your personal music preferences". Both questions were answered on a 5-point visual scale with halves (thus 10 actual options) containing stars and heart icons as shown in Figure 2. The playlist evaluation included playlist attractiveness and playlist diversity and is presented in Table 1. Additional scales used in the study were a demographics scale and the Goldsmith Music Sophistication Index (MSI) [18]. The de- mographics scale measured gender, age, and Spotify usage. Spotify usage was measured using a single item: "I listen to Spotify for __ hours a week" with 7 range options. Table 1: The playlist evaluation scale Figure 2: Preview of the track rating screen as displayed to Concept Item the participants during the study. Perceived The playlist was attractive attractiveness The playlist showed too many bad items Alpha: .94 The playlist matched my preferences 5 RESULTS Participants in this study included 59 people, of which 54 were Perceived The playlist was varied recruited through the JF Schouten database. The sample consisted diversity The tracks differed a lot from each other on of 31 males and 28 females. The age of the participants ranged from Alpha: .85 different aspects 19 to 64 (M = 25.6, SD = 8.8). On average participants listened to All the tracks were similar to each other Spotify for 7 to 10 hours per week. MSI scores ranged between 0 Note. The scale is adapted from [27]. and 5 (M = 2.18, SD = 1.0). The study took place between 9th of January and 1st of February of 2019. How playlist evaluation compares to track evaluations in music recommender systems IntRS ’19, September 19, 2019, Copenhagen, Denmark We found that there was no effect of personalization rating on indicate those two tracks with a 1, otherwise 0. The third indicator perceived attractiveness, while likability rating can partially predict is the familiarity which shows whether a track was predicted to be perceived attractiveness. Furthermore, playlist attractiveness was familiar to the user based on their top tracks and artists. Finally, more strongly related to the recommendation algorithm. Playlists the last indicator contains the playlist order. This variable indicates in the gmm condition were less positively evaluated compared to whether the track was among the list which the user evaluated playlists in the other conditions even though the track evaluations firstly, secondly, or thirdly. were similar on average. In other words, while participants eval- uated tracks across conditions similarly, the playlist evaluations 5.2 Model results differed substantially (see Figure 3). 5.2.1 Algorithm performance. The relation between recommenda- tion scores and user evaluations of tracks is depicted in Figure 4. The illustration indicates that differences exist between algorithms in their performance on track evaluations. This is supported by an analysis of variance (ANOVA), F (2, 171) = 36.8, p < .001. The graph shows that for all algorithms, higher recommendation scores result in higher user ratings, showing that indeed tracks that are predicted to be liked better also get higher ratings. However, consistent with Figure 3, the scores for the base condition are consistently higher than for the other two algorithms. For the дenre condition the slope seems to be steeper than for the other two conditions, showing that in this condition, user ratings are more sensitive to the predicted recommendation scores. Figure 3: Participants’ subjective evaluations of individual tracks (left) and playlists (right). The error bars indicate the standard error. 5.1 Overview of statistical methods The results are analyzed using three methodologies. The first method- ology concerns the performance of the recommendation algorithms. This was analyzed using descriptive statistics concerning the rela- tion between recommendation scores predicted by the algorithms and the user ratings. In the second methodology the relation between playlist evalua- tions and track ratings was aggregated on the playlist-level (i.e. 3 observations per user). In this methodology, an aggregate measure for track evaluation was used, more specifically, three aggrega- tion measures: mean (Model 1), peak-end (Model 2), occurrence of Figure 4: The relation between subjective user ratings and at least a 3-star rating (Model 3). Using these aggregates, a linear recommendation scores predicted by the algorithms. The mixed-effects model was used such that variation in participants’ an- user ratings are slightly jittered for the scatterplot only. The swering style can be included as a random-effects variable. Playlist shaded area represents the 95% confidence interval. diversity and the recommendation approaches were included as fixed-effects variables in Model 1a, Model 2a and Model 3a, and interaction-effects were included in Model 1b, Model 2b and Model 5.2.2 Playlist-level relation between track evaluations and playlist 3b. evaluations. In this analysis, the effect of track evaluations on Finally, the last methodology explores how variations within the playlist evaluations is explored on a playlist-level, using three dif- track-level may explain playlist attractiveness. This analysis used ferent aggregation measures (Models 1-3). a linear mixed-effects model on the track level (i.e. 3x10 observa- The effect of track evaluations on playlist attractiveness is illus- tions per user) (see Table 3: Model 4) with participants modelled trated in Figure 5. All three aggregation measures are very similar as a random-effects variable, similar to the playlist-level analysis. in predicting playlist attractiveness (see Table 2). We see a positive For the track-level variables four types of indicators were included effect of the aggregating measure, indicating that if a user scores additional to the rating, condition, and diversity. The first indicator higher on that measure, she also finds the playlist more attractive, indicates whether the track was high-ranked (top 3 recommenda- together with negative effects of the conditions дenre and дmm tion) or low-ranked (top 20 to 300). The second indicates for each consistent with the effect in Figure 3 that дmm and дenre score track whether it was the highest rating of the playlist. Thus, if a lower than the base condition. The aggregate indicates occurrence user gave two 4-star ratings and 8 lower ratings, the variable would of at least a 3-star rating (model 3) is a slightly worse predictor IntRS ’19, September 19, 2019, Copenhagen, Denmark Hadash et al. Table 2: Playlist attractiveness by aggregated track evaluations (playlist-level) Mean Peak-end Positive Model 1a Model 1b Model 2a Model 2b Model 3a Model 3b rating (aggregate) 0.319∗∗∗ 0.071 0.274∗∗ 0.098 0.104∗ 0.022 (0.095) (0.165) (0.091) (0.151) (0.043) (0.071) genre −0.090∗ −0.665∗∗∗ −0.081∗ −0.643∗∗∗ −0.095∗ −0.503∗∗∗ (0.039) (0.174) (0.039) (0.188) (0.038) (0.139) gmm −0.364∗∗∗ −0.741∗∗∗ −0.351∗∗∗ −0.840∗∗∗ −0.356∗∗∗ −0.730∗∗∗ (0.039) (0.175) (0.038) (0.194) (0.038) (0.129) diversity −0.059 −0.416∗∗ −0.067 −0.424∗∗ −0.078 −0.419∗∗ (0.074) (0.133) (0.074) (0.132) (0.074) (0.134) rating (aggregate):genre 0.581∗ 0.422∗ 0.162 (0.228) (0.215) (0.101) rating (aggregate):gmm 0.127 0.230 0.118 (0.224) (0.217) (0.101) genre:diversity 0.428∗ 0.444∗ 0.481∗∗ (0.183) (0.183) (0.185) gmm:diversity 0.526∗∗ 0.546∗∗ 0.475∗∗ (0.174) (0.174) (0.178) Constant 0.512∗∗∗ 0.865∗∗∗ 0.492∗∗∗ 0.836∗∗∗ 0.620∗∗∗ 0.889∗∗∗ (0.076) (0.135) (0.086) (0.141) (0.062) (0.102) N 176 176 176 176 176 176 Log Likelihood 17.052 24.794 17.007 23.711 15.602 21.237 AIC −20.105 −27.588 −20.013 −25.421 −17.204 −20.475 BIC 2.089 7.287 2.180 9.454 4.990 14.401 2 RG .351 .409 .342 .401 .330 .383 LM M (m) Random Effect # of Participants 59 59 59 59 59 59 Participant SD 0.063 0.053 0.08 0.054 0.083 0.064 Note. SD = standard deviation. The models are grouped by the method used for aggregating track evaluations. ’Mean’ = mean value, ’peak-end’ = average of highest rating and the last rating, ’positive’ = indicator for occurrence of at least a 3-star evaluation. ∗∗∗ p < .001; ∗∗ p < .01; ∗ p < .05. for playlist attractiveness compared to the mean and peak-end of the recommended playlist. For instance, a person may listen to measures. different genres during varying activities like working and sporting. When the interaction-effects are included, the main-effect of The recommendations could then include music based on all these ratings is no longer significant (models 1b, 2b and 3b) but we get genres. While all recommendations are then closely related to the several interactions of ratings with condition and condition with users’ preferences and could receive potentially high evaluations, diversity. The interaction-effects of condition with perceived diver- the playlist may not be very attractive due to the diversity in the sity and track evaluations are visualized in Figure 6 by separating genres. the resulting effects by condition and we will discuss each condition In the дenre condition, perceived diversity had no effect on and it’s interactions separately. playlist attractiveness. In this condition track evaluations strongly The track evaluations had no effect on playlist evaluation in the predicted playlist attractiveness regardless of diversity. The results base condition (they do for the other two conditions, as we will show that though the дenre playlist on average get a lower at- see below). Moreover, in the base condition, perceived diversity tractiveness score than the base, this effect is reduced when the has a negative effect, indicating that playlists with high perceived aggregate ratings of the list are higher: in other words, only if users diversity were less attractive compared to playlists with low per- like the дenre tracks, they like the playlist as much as the base one ceived diversity. One potential explanation could be that since these that has more low-spread, familiar tracks. playlists were constructed using a low-spread approach the recom- The дmm condition had similar results as the дenre condition. mendations were closely related to the users’ known preferences Perceived diversity predicted attractiveness only marginally. How- (i.e. their top tracks that feed our algorithms). Therefore, the diver- ever, while the track evaluations strongly predict attractiveness in sity in these users’ preferences may have influenced the diversity How playlist evaluation compares to track evaluations in music recommender systems IntRS ’19, September 19, 2019, Copenhagen, Denmark the mean rating, the peak-end value or the fact that at least one track is highly rated. We see that some conditions are more sensi- tive to these aggregate rating (дenre) than the others. We also see an important (negative) role of diversity for the base condition in predicting overall attractiveness, but no effect in the other two con- ditions. In other words, different aspects affect playlist evaluation as recognized in the literature, but this highly depends on the nature of the underlying algorithm generating the recommendations. Table 3: Playlist attractiveness by track evaluations (track- level) Model 4 rating −0.009 Figure 5: Playlist attractiveness by track rating (mean). The (0.020) dot size indicates the number of duplicate items of the genre −0.095∗∗∗ playlist in the playlists of the other conditions. (0.009) gmm −0.352∗∗∗ (0.009) diversity −0.027∗∗∗ (0.006) high-ranked 0.003 (0.009) highest rating 0.002 (0.012) familiar −0.011 (0.012) playlist order 0.012∗ (0.005) Constant 0.704∗∗∗ (0.029) N 1850 Log Likelihood 630.272 AIC −1238.544 BIC −1177.791 2 RGLM .307 M (m) Random Effect # of Participants 58 Participant SD 0.156 Figure 6: Linear model of playlist attractiveness by track rat- Note. SD = standard deviation, ’High-ranked’ indicates the track ings and condition for each condition. was one of the top-3 recommendations, ’highest rating’ indi- cates the track received the highest rating within that playlist the дenre condition, it is only a weak predictor in the дmm condi- for the participant, ’familiar’ indicates whether the track was tion. In other words, high aggregate ratings cannot really make up known to be familiar to the participant, ’playlist order’ indicates for the fact that the дmm list in general is evaluated worse than the whether the playlist was the first (=1), second (=2), or third (=3) base list. As in the дenre condition this recommendation algorithm list that the participant evaluated. Interaction terms as in Mod- uses a high-spread approach and includes novel track recommen- els 1-3 were omitted due to similarity to these models. ∗∗∗ p < dations. However, the дmm recommended tracks based on audio .001; ∗∗ p < .01; ∗ p < .05. feature similarity is in contrast to genre similarity. Regardless of di- versity or individual track evaluations, playlists using this approach 5.2.3 Track-level relation between track evaluations and playlist eval- were less attractive to participants. uations. In this analysis, the effect of track evaluations on playlist Overall we find that overall attractiveness of a playlist is not evaluations is explored at track-level, trying to predict the overall always directly related to the liking of the individual tracks, as attractiveness of each list with the individual track ratings, rather reflected by the aggregate ratings of the tracks, whether this is than the aggregate ratings. The results are shown in Table 3. Four IntRS ’19, September 19, 2019, Copenhagen, Denmark Hadash et al. types of track-level variables are included in the analysis as de- Another explanation may be the methodology of evaluation. scribed in Section 5.1. While tracks are evaluated at the moment they are experienced, Whether a track is high ranked or received the highest rat- playlist evaluation occurs only after the tracks are experienced. ing shows no significant effect on perceived attractiveness of the Therefore, playlist evaluations are based on what users remember playlist. The track-level objective familiarity measures if the user from the list. This difference may lead to differences in user eval- is familiar with the artists of a track. The user is familiar with a uation styles. Although this may explain why differences occur track if at least one artist of the track also appears in the user’s top between track and playlist evaluations, it cannot explain why the listened tracks related artists. Although we expected there would different recommendation approaches lead to different playlist at- be a positive effect of familiarity on playlist attractiveness (as also tractiveness evaluations. Furthermore, using this explanation we shown in [7]), there was no significant effect observed in model 4. A would have expected a model improvement from the inclusion of possible reason could be the objective familiarity measure was not the peak-end measure. The peak-end measure specifically models sufficient to cover all tracks that the user is familiar with since it is how users remember different moments in their overall experience only measured with the user’s top tracks (the number is at most 50 while listening to a playlist [26]. However, peak-end resulted in for each user). In our future work, we are planning to directly ask similar effects as using a standard mean-aggregation rating. for (self-reported) familiarity, rather than calculating these from Regardless of the explanation, the results show that playlist the data. We also calculated a familiarity score for each track (how attractiveness is not primarily related to the likeability of its tracks much the user is familiar with the track). We found that there was a but that other factors such as diversity can play a role. positive correlation between objective familiarity and track ratings (r s (1770) = 0.326, p < .001): users give higher ratings to tracks 7 CONCLUSION AND FUTURE WORK they are more familiar with, which is in line with previous work While playlist evaluations can be partly predicted by evaluations of on mere exposure effect [1]. its tracks, other factors of the playlist are more predictive. People Playlist order is also a weak predictor of playlist attractiveness. seem to evaluate playlists on other aspects than merely its tracks. Participants perceive the last playlist as the most attractive and the Even when individual tracks were rated positively, the playlist first as the least attractive. However, when interaction terms as in attractiveness could be low. models 1-3 are included the effect is no longer significant. We also We found that both diversity and recommendation approach checked the condition orders generated by the random generator affected playlist attractiveness. Diversity had a negative effect on and found that each condition order occurred approximately equally playlist attractiveness in recommenders using a low-spread method- often. In other words, the effect of condition order can not explain ology. The track ratings were the most predictive for the playlist difference across conditions. attractiveness in the recommendation approach based on genre similarity. Furthermore, inclusion of the highest and last track eval- 6 DISCUSSION OF RESULTS uation score (peak-end) was sufficient to predict playlist attractive- ness, performing just as well as the mean of the ratings. We found that participants evaluate playlists on more aspects than When evaluating recommendation approaches in music recom- merely the likeability of its tracks. Even though the tracks in rec- menders, it is important to consider which evaluation metric to ommended playlists may be accurate and receive positive user use. Music is often consumed in succession leading to many factors evaluations, playlists can still be evaluated negatively. In particu- other than track likeability that may influence whether people have lar, the recommendation approach itself plays a role in the overall satisfactory experiences. Although individual track evaluations are perceived playlist attractiveness. often used in recommender evaluation, track evaluations do not One explanation may be that users have different distinct musi- seem to predict playlist attractiveness very consistently. cal styles. Playlists that contain music from more than one of the While we showed that playlist attractiveness is not primarily users’ styles may be less attractive to the user even though the related to track evaluations, we were unable to effectively measure track recommendations are accurate. Playlists in the base condition why certain algorithms generated more attractive playlists com- are most attractive, but suffer most from diversity. Users with mul- pared to others. This question will be addressed in future work. tiple musical styles may have received playlists with music from We intent to include a subjective measure for track familiarity. Fur- multiple styles which could have been reflected in the perceived thermore, we will identify and attempt to separate distinct musical diversity of the playlist. Playlists from the дenre condition were styles within user preferences. For example, we could give users also based on genre similarity, in addition to the track and artist control about which top artists or top tracks they would like to use similarity. Therefore, if multiple musical styles are present in the to generate recommendations as in [11] to separate the tracks and user preferences, it is more likely in the дenre condition that the artists they like under different context. musical style with the highest overall contribution overrules the music from the other musical styles. Furthermore, the дmm con- dition is least attractive. The recommendation algorithm used in REFERENCES [1] Luke Barrington, Reid Oda, and Gert RG Lanckriet. 2009. Smarter than Genius? this condition is based on audio feature similarity. Although tracks Human Evaluation of Music Recommender Systems.. In ISMIR, Vol. 9. Citeseer, recommended in this condition were similar to the user preferences 357–362. based on the audio features, they could be dissimilar based on more [2] Dmitry Bogdanov, MartíN Haro, Ferdinand Fuhrmann, Anna Xambó, Emilia Gómez, and Perfecto Herrera. 2013. Semantic audio content-based music recom- comprehensible attributes like genre and artists. It is likely that mendation and visualization based on user preference examples. Information music from multiple musical styles were present in these playlists. Processing & Management 49, 1 (2013), 13–33. How playlist evaluation compares to track evaluations in music recommender systems IntRS ’19, September 19, 2019, Copenhagen, Denmark [3] Dirk Bollen, Bart P Knijnenburg, Martijn C Willemsen, and Mark Graus. 2010. User Modeling and User-Adapted Interaction 22, 4-5 (2012), 441–504. Understanding choice overload in recommender systems. In Proceedings of the [16] Arto Lehtiniemi and Jukka Holm. 2011. Easy Access to Recommendation Playlists: fourth ACM conference on Recommender systems. ACM, 63–70. Selecting Music by Exploring Preview Clips in Album Cover Space. Proceedings [4] Òscar Celma and Perfecto Herrera. 2008. A new approach to evaluating novel of the 10th International Conference on Mobile and Ubiquitous Multimedia (2011), recommendations. In Proceedings of the 2008 ACM conference on Recommender 94–99. https://doi.org/10.1145/2107596.2107607 systems. ACM, 179–186. [17] Martijn Millecamp, Nyi Nyi Htun, Yucheng Jin, and Katrien Verbert. 2018. Con- [5] Zhiyong Cheng. 2011. Just-for-Me : An Adaptive Personalization System for trolling Spotify Recommendations. (2018), 101–109. https://doi.org/10.1145/ Location-Aware Social Music Recommendation Categories and Subject Descrip- 3209219.3209223 tors. (2011). [18] Daniel Müllensiefen, Bruno Gingras, Lauren Stewart, and Jason Ji. 2013. Gold- [6] Michael D. Ekstrand, F. Maxwell Harper, Martijn C. Willemsen, and Joseph A. smiths Musical Sophistication Index (Gold-MSI) v1.0: Technical Report and Doc- Konstan. 2014. User perception of differences in recommender algorithms. Pro- umentation Revision 0.3. Technical Report. Goldsmiths University of London, ceedings of the 8th ACM Conference on Recommender systems - RecSys ’14 (2014), London. https://www.gold.ac.uk/music-mind-brain/gold-msi/ 161–168. https://doi.org/10.1145/2645710.2645737 [19] F. Pachet, G. Westermann, and D. Laigre. 2001. Musical data mining for electronic [7] Bruce Ferwerda, Mark P Graus, Andreu Vall, Marko Tkalcic, and Markus Schedl. music distribution. Proceedings - 1st International Conference on WEB Delivering 2017. How item discovery enabled by diversity leads to increased recommenda- of Music, WEDELMUSIC 2001 May 2014 (2001), 101–106. https://doi.org/10.1109/ tion list attractiveness. In Proceedings of the Symposium on Applied Computing. WDM.2001.990164 ACM, 1693–1696. [20] Steffen Pauws and Berry Eggen. 2003. Realization and user evaluation of an [8] Sophia Hadash. 2019. Evaluating a framework for sequential group music recom- automatic playlist generator. Journal of new music research 32, 2 (2003), 179–192. mendations: A Modular Framework for Dynamic Fairness and Coherence control. [21] Alexander Rozin, Paul Rozin, and Emily Goldberg. 2004. The feeling of music past: Master. Eindhoven University of Technology. https://pure.tue.nl/ws/portalfiles/ How listeners remember musical affect. Music Perception: An Interdisciplinary portal/122439578/Master_thesis_shadash_v1.0.1_1_.pdf Journal 22, 1 (2004), 15–39. [9] Shobu Ikeda, Kenta Oku, and Kyoji Kawagoe. 2018. Music Playlist Recommen- [22] Thomas Schäfer, Doreen Zimmermann, and Peter Sedlmeier. 2014. How we dation Using Acoustic-Feature Transition Inside the Songs. (2018), 216–219. remember the emotional intensity of past musical experiences. Frontiers in https://doi.org/10.1145/3151848.3151880 Psychology 5 (2014), 911. [10] Tristan Jehan and David Desroches. 2004. Analyzer Documentation [ver- [23] Markus Schedl, Arthur Flexer, and Julián Urbano. 2013. The neglected user in sion 3.2]. Technical Report. The Echo Nest Corporation, Somerville, music information retrieval research. Journal of Intelligent Information Systems MA. http://docs.echonest.com.s3-website-us-east-1.amazonaws.com/_static/ 41, 3 (2013), 523–539. AnalyzeDocumentation.pdf [24] Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi [11] Yucheng Jin, Bruno Cardoso, and Katrien Verbert. 2017. How do different levels Elahi. 2018. Current challenges and visions in music recommender systems of user control affect cognitive load and acceptance of recommendations?. In research. International Journal of Multimedia Information Retrieval 7, 2 (2018), CEUR Workshop Proceedings, Vol. 1884. CEUR Workshop Proceedings, 35–42. 95–116. [12] Yucheng Jin, Nava Tintarev, and Katrien Verbert. 2018. Effects of personal char- [25] Morgan K Ward, Joseph K Goodman, and Julie R Irwin. 2014. The same old song: acteristics on music recommender systems with different levels of controllability. The power of familiarity in music choice. Marketing Letters 25, 1 (2014), 1–11. (2018), 13–21. https://doi.org/10.1145/3240323.3240358 [26] Eelco C. E. J. Wiechert. 2018. The peak-end effect in musical playlist experiences. [13] Daniel Kahneman. 2011. Thinking, fast and slow. Macmillan. Master. Eindhoven University of Technology. [14] Iman Kamehkhosh and Dietmar Jannach. 2017. User Perception of Next-Track [27] Martijn C. Willemsen, Mark P. Graus, and Bart P. Knijnenburg. 2016. Under- Music Recommendations. (2017), 113–121. https://doi.org/10.1145/3079628. standing the role of latent feature diversification on choice difficulty and sat- 3079668 isfaction. User Modelling and User-Adapted Interaction 26, 4 (2016), 347–389. [15] Bart P Knijnenburg, Martijn C Willemsen, Zeno Gantner, Hakan Soncu, and https://doi.org/10.1007/s11257-016-9178-6 Chris Newell. 2012. Explaining the user experience of recommender systems.