=Paper= {{Paper |id=Vol-2450/paper1 |storemode=property |title=How Playlist Evaluation Compares to Track Evaluations in Music Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-2450/paper1.pdf |volume=Vol-2450 |authors=Sophia Hadash,Yu Liang,Martijn C. Willemsen |dblpUrl=https://dblp.org/rec/conf/recsys/HadashLW19 }} ==How Playlist Evaluation Compares to Track Evaluations in Music Recommender Systems== https://ceur-ws.org/Vol-2450/paper1.pdf

How playlist evaluation compares to track evaluations in music
recommender systems
Sophia Hadash Yu Liang Martijn C. Willemsen
Jheronimus Academy of Data Science Jheronimus Academy of Data Science Eindhoven University of Technology
5211 DA ’s-Hertogenbosch, The 5211 DA ’s-Hertogenbosch, The 5600 MB Eindhoven, The Netherlands
Netherlands Netherlands Jheronimus Academy of Data Science
s.hadash@tue.nl y.liang1@tue.nl 5211 DA ’s-Hertogenbosch, The
Netherlands
m.c.willemsen@tue.nl
ABSTRACT ACM Reference Format:
Most recommendation evaluations in music domain are focused Sophia Hadash, Yu Liang, and Martijn C. Willemsen. 2019. How playlist
evaluation compares to track evaluations in music recommender systems.
on algorithmic performance: how a recommendation algorithm
In Proceedings of Joint Workshop on Interfaces and Human Decision Making
could predict a user’s liking of an individual track. However, indi- for Recommender Systems (IntRS ’19). CEUR-WS.org, 9 pages.
vidual track rating might not fully reflect the user’s liking of the
whole recommendation list. Previous work has shown that sub-
jective measures such as perceived diversity and familiarity of the
recommendations, as well as the peak-end effect can influence the 1 INTRODUCTION
user’s overall (holistic) evaluation of the list. In this study, we inves- In user-centric evaluation of personalized music recommendation,
tigate how individual track evaluation compares to holistic playlist users are usually asked to indicate their degree of liking of individ-
evaluation in music recommender systems, especially how playlist ual tracks [2, 4, 5, 14] or by providing a holistic assessment of the
attractiveness is related to individual track rating and other subjec- entire playlist (e.g. playlist satisfaction or playlist attractiveness)
tive measures (perceived diversity) or objective measures (objective [6, 9, 12, 16, 17] generated by the recommendation approaches.
familiarity, peak-end effect and occurrence of good recommenda- Most recommender evaluations are focused on the first type of
tions in the list). We explore this relation using a within-subjects evaluation to test algorithmic performance: can we accurately pre-
online user experiment, in which recommendations for each condi- dict the liking of an individual track. Many user-centric studies in
tion are generated by different algorithms. We found that individual the field [15] however focus on the second metric, does the list of
track ratings can not fully predict playlist evaluations, as other fac- recommendations provide a satisfactory experience. Often these
tors such as perceived diversity and recommendation approaches studies find playlist satisfaction is not just about the objective or
can influence playlist attractiveness to a larger extent. In addition, subjective accuracy of the playlist, but also depends on the difficulty
inclusion of the highest and last track rating (peak-end) is equally of choosing from the playlist or playlist diversity [27]. For example,
good in predicting playlist attractiveness as the inclusion of all in the music domain perceived diversity of the playlist has been
track evaluations. Our results imply that it is important to consider shown to have a negative effect on overall playlist attractiveness [7].
which evaluation metric to use when evaluating recommendation Bollen et al. [3] showed people were just as satisfied with a list of
approaches. 20 movie recommendations which included the top-5 list and a set
of lower ranked items (twenty’s item being the 1500th best rank)
CCS CONCEPTS as with a list of the best 20 recommendations (top-20 ranked).
Research in psychology also shows that people’s memory of
• Human-centered computing → Empirical studies in HCI; overall experience is influenced by the largest peak and end of
Heuristic evaluations; • Information systems → Recommender the experience rather than the average of the moment to moment
systems; Relevance assessment; Personalization. experience [13]. Similar effects might occur when we ask users to
evaluate holistically a list on attractiveness: they might be triggered
KEYWORDS more by particular items in the list (i.e. ones that they recognize
User-centric evaluation, recommender systems, playlist and track as great (or bad), ones that are familiar rather than ones that are
evaluation unknown, cf. mere exposure effect [25]) and therefore their overall
impression might not simply be the mean of the individual ratings.
These results from earlier recommender research and from psy-
chological research suggest that overall (holistic) playlist evaluation
is not just reflected by the average of liking or rating of the individ-
IntRS ’19, September 19, 2019, Copenhagen, Denmark
Copyright © 2019 for this paper by its authors. Use permitted under Creative Com-
ual items. However, to our best knowledge, no previous work has
mons License Attribution 4.0 International (CC BY 4.0). IntRS ’19: Joint Workshop explored the relation between users’ evaluation of individual tracks
on Interfaces and Human Decision Making for Recommender Systems, 19 Sept 2019, and overall playlist evaluation. To some extent this is because it is
Copenhagen, DK
. not common that both types of data are collected in the same study.
Therefore, in this work, we would like to investigate how individual
IntRS ’19, September 19, 2019, Copenhagen, Denmark Hadash et al.

item evaluations relate to holistic evaluations in sequential music list can help users to discover new music and enrich their music
recommender systems. tastes.
We explore these relations using a within-subject online experi- The novel contribution of this work is that we include both mea-
ment, in which users are asked to give individual ratings as well as surements in the study for personalized music recommendations,
overall perception of playlist attractiveness and diversity in three aiming to uncover the relation between individual track evaluation
conditions: (1) track and artist similarity algorithm (base), (2) track and holistic evaluation of music playlists.
and artist similarity algorithm combined with genre similarity algo-
rithm (genre) and (3) track and artist similarity algorithm combined 2.2 Peak-end effect
with audio feature algorithm (gmm). The track and artist similarity Research in psychology has looked into the differences between
algorithm can be regarded as a low-spread strategy since recom- the ‘remembering self’ and the ‘experiencing self’ [13], as reflected
mendations are generated from a small subset of the total pool in the peak-end rule: the memory of the overall experience of a
of tracks relatively close to the user’s tastes [8]. Both the genre painful medical procedure is not simply the sum or average of the
approach and gmm approach are high-spread strategies which gen- moment to moment experience, but the average of the largest peak
erates user-track ratings for a large proportion of the total pool of and the end of the experience.
tracks. In music domain, several studies have found that the remem-
In this study, we are interested in how perceived attractiveness of bered intensity of the music listening experience is highly corre-
the playlist is related to perceived playlist diversity and individual lated with peak, peak-end and average moment-to-moment expe-
track ratings across the three conditions. In addition, we also include rience [21, 22]. However, it is argued by Wiechert [26] that these
a set of objective features of the playlist in the analysis. We test studies fail to consider users’ personal musical preferences and
whether that users’ perceived attractiveness of the playlist will that the peak-end value and the average value measured in the
also be affected by (1) the peak-end effect: the track they like most studies might be correlated with each other. Rather than giving
and the end track, (2) their familiarity to the recommendations in participants the same stimuli, Wiechert gave participants a list of
the playlist and (3) occurrences of good recommendations in the songs based on their current musical preference and came up with a
playlist: people might be satisfied with a playlist as long as at least new metric: pure peak-end value (the difference between peak-end
some recommendations are good. and average). He found that while the average experience could
explain a significant part of playlist experience variance, the pure
2 RELATED WORK peak-end value could explain a part of variance that would not be
2.1 User-centric evaluation in music explained by the average.
recommendation
3 METHOD
User-centric evaluation for recommendation approaches is neces-
sary in order to understand users’ perception of the given recom- In this study three algorithms are used for generating playlists.
mendations [15], such as acceptance or satisfaction [23, 24]. These algorithms are designed to use user preferences in the form
User-centic evaluation in music recommendation can be at indi- of (ordered) lists of tracks, artists, or genres a user is known to
vidual track level or whole playlist level. Users’ perception towards like. The advantage of using such an input form is that these algo-
the whole playlists are often measured under the context of auto- rithms can be used with user preferences obtained from commercial
matic playlist generation [20], smooth track transition [9] or when platforms. In this study Spotify 1 user profiles are used. These pref-
the goal is to evaluate the whole recommender system [12, 17]. For erences are in the form of ordered lists of top tracks and artists. The
example, users were asked to indicate their perception towards first algorithm is based on track and artist similarity. The second al-
the recommended playlists to investigate how different settings of gorithm uses a genre similarity metric based on genre co-occurrence
control in the recommender system influence their cognitive load among artists. The third algorithm recommends tracks based on
as well as their acceptance to the recommendations [12]. However, a Gaussian mixture model on track features derived from audio
when it comes to the evaluation of the recommendation algorithms, analyses (see [10] for details). All algorithms are described in detail
users are often asked to indicate their ratings[2, 4] for each indi- in [8].
vidual track rather than the playlist as a whole, neglecting the fact
that tracks are often listened in succession or within a playlist. 3.1 Track and artist similarity algorithm
Individual item ratings can not fully reflect users’ degree of lik- The track and artist similarity algorithm is a combination of the
ing towards the recommendation list. Perceived diversity is a factor same sub-algorithm applied to both a list of tracks and artists a user
that can only be measured at the list level. Willemsen et al. [27] is known to like. The input to this sub-algorithm is a list of items,
has shown that perceived diversity of a movie recommendation list potentially ordered on user likeability. This sub-algorithm uses
has a positive effect on perceived list attractiveness and a higher Spotify’s seed recommendation system to explore items that are
perceived diversity would make it easier for users to make a choice similar to the input. Based on the occurrence of items in the results,
from the recommendations. Ekstrand et al. [6] also show that per- an output list is generated with user likeability prediction scores.
ceived diversity has a positive effect on user satisfaction. While in The algorithm is formulated in Algorithm 1, with an illustration by
music domain, Ferwerda et al. [7] found that perceived diversity has example in Figure 1.
a negative effect on perceived attractiveness of the recommendation
list, however, this effect turns to positive when the recommendation 1 https://developer.spotify.com/documentation/web-api/
How playlist evaluation compares to track evaluations in music recommender systems IntRS ’19, September 19, 2019, Copenhagen, Denmark

Figure 1: Illustration of the track and artist similarity algorithm using an example.

Algorithm 1 Track and artist similarity was known which genres he/she produced music in. The data is
si : score of item i, x: temporary score of item i, recs: extracted from Spotify’s developer API. The co-occurence analysis
recommendation set, N : number of sibling nodes, posnod e : generated a normalized symmetric similarity matrix D. The like-
position of the node in its parents’ children, smin , smax : scores ability scores of the user towards the list of genre is then computed
assigned to the last and first sibling node at the current tree depth. as follows, where I is the identity matrix.
1: for each item i in recs do S u = (D + I )S u′ (1)
2: si = 0
3: for each node j as current_node where node j is item i do 3.3 Audio feature algorithm
4: x = current_node.score The audio feature algorithm clusters tracks with similar audio
5: while current_node.parent not null do features using a Gaussian mixture model (GMM). A database of
6: current_node = current_node.parent n ≈ 500.000 tracks containing 11 audio analysis features were used
7: x = x ∗ current_node.score to train the model. The audio features consisted of measures for
8: end while danceability, energy, key, loudness, mode, speechiness, acousticness,
9: si = si + x instrumentalness, liveness, valence, and tempo. Multiple GMM’s
10: end for were fitted using the expectation-maximization (EM) algorithm for
11: end for varying component numbers. The model with 21 components had
12: return (recs, s) order by s descending the lowest BIC criterion and was therefore selected. Then, cluster
13: likeability was computed as follows (see [8]):
14: def node.score:
N −pos nod e
15: return N (smax − smin ) + smin NÕ
t op
1
p(user likes cluster i) = p(track j belonдs to cluster i)
Ntop j=1
3.2 Genre similarity algorithm (2)
Finally, the output recommendations favored tracks correspond-
The genre similarity algorithm uses an ordered list of genres the
ing to clusters with high user likeability probabilities.
user likes S u′ (a column vector which shows the user degree of
liking to all genres built from the user’s top artists) and a similarity
metric D to generate genre likeability scores for other genres. Then,
3.4 Familiarity of the recommendations to the
the resulting extrapolated list S u is used to favor recommendations users
from genres with high likeability scores. Both the track and artist similarity and the genre similarity al-
There are 1757 different types of genres available in our dataset, gorithms generate recommendations close to the users’ known
therefore both S u′ and S u are column vectors of dimension 1757 preferences. Recommendations are based on artists and genres that
and matrix D is of dimension 1757 × 1757. are familiar to the user. The audio feature algorithm on the other
The similarity metric is based on co-occurrence analysis of artists, hand recommends tracks based on audio feature similarity. As a
similar to the methodology used in [19]. The co-occurrence analysis result, recommended tracks are more likely to have genres and
used a database consisting of n ≈ 80.000 artists. For each artist it artists that are less familiar to the users.
IntRS ’19, September 19, 2019, Copenhagen, Denmark Hadash et al.

4 EXPERIMENTAL DESIGN 4.3 Study Procedure
To evaluate the relation between track evaluations and playlist After consenting, participants were prompted with a login screen
evaluations, a within-subjects online experiment was conducted. where they could connect their Spotify account with the study. Par-
The study included three conditions in randomized order: track and ticipants who did not have a Spotify account or who had a Spotify
artist algorithm (base), track and artist algorithm combined with account containing no user preference data could not continue with
the genre similarity algorithm (genre), and track and artist algo- the study. After Spotify login, participants completed a background
rithm combined with the audio feature algorithm (gmm). In each survey. In the survey they reported their Spotify usage and music
condition participants were presented with a playlist containing 10 sophistication.
tracks generated by the corresponding algorithm and evaluated the Following the background survey, the user entered the track
individual tracks on likeability and personalization and the playlist evaluation phase in which a playlist was presented to the user
as a whole on attractiveness and diversity. The playlist included generated by one of the algorithms. The interface (see Figure 2)
the top 3 recommendations and the 20th , 40th , 60th , 80th , 100th , contained an interactive panel showing the tracks of the playlist, a
200th , and 300th recommendation in random order. Lower ranked survey panel in which they had to rate the tracks, and a music con-
recommendations were included such that algorithm performance trol bar. Participants could freely browse through the playlist while
could be evaluated more easily, as lower ranked recommendations providing the ratings. After all ratings were provided, participants
should result in lower user evaluations. entered the playlist evaluation phase in which they answered the
playlist evaluations questions (Table 1). The track evaluation phase
4.1 Participant Recruitment and playlist evaluation phase were then repeated for the remaining
Participants were primarily recruited using the JF Schouten partici- conditions.
pant database of Eindhoven University of Technology. Some par- Finally, participants were thanked for their time and were en-
ticipants were recruited by invitation. Participants were required tered into a reward raffle. Among every 5 participants one par-
to have a Spotify account (free or Premium) and to have used this ticipant received 15 euro compensation. In total the study lasted
account prior to taking part in the study. approximately 15 minutes.

4.2 Materials
The track evaluations included likeability and personalization mea-
sures. One question was used for each of the tracks. This was
decided based on the repetitive nature of individual track evalua-
tions. The questions for measuring track likeability was: "Rate how
much you like the song". For measuring perceived track personaliza-
tion we used the following item: "Rate how well the song fits your
personal music preferences". Both questions were answered on a
5-point visual scale with halves (thus 10 actual options) containing
stars and heart icons as shown in Figure 2.
The playlist evaluation included playlist attractiveness and playlist
diversity and is presented in Table 1.
Additional scales used in the study were a demographics scale
and the Goldsmith Music Sophistication Index (MSI) [18]. The de-
mographics scale measured gender, age, and Spotify usage. Spotify
usage was measured using a single item: "I listen to Spotify for __
hours a week" with 7 range options.

Table 1: The playlist evaluation scale

Figure 2: Preview of the track rating screen as displayed to
Concept Item
the participants during the study.
Perceived The playlist was attractive
attractiveness The playlist showed too many bad items
Alpha: .94 The playlist matched my preferences 5 RESULTS
Participants in this study included 59 people, of which 54 were
Perceived The playlist was varied
recruited through the JF Schouten database. The sample consisted
diversity The tracks differed a lot from each other on
of 31 males and 28 females. The age of the participants ranged from
Alpha: .85 different aspects
19 to 64 (M = 25.6, SD = 8.8). On average participants listened to
All the tracks were similar to each other
Spotify for 7 to 10 hours per week. MSI scores ranged between 0
Note. The scale is adapted from [27]. and 5 (M = 2.18, SD = 1.0). The study took place between 9th of
January and 1st of February of 2019.
How playlist evaluation compares to track evaluations in music recommender systems IntRS ’19, September 19, 2019, Copenhagen, Denmark

We found that there was no effect of personalization rating on indicate those two tracks with a 1, otherwise 0. The third indicator
perceived attractiveness, while likability rating can partially predict is the familiarity which shows whether a track was predicted to be
perceived attractiveness. Furthermore, playlist attractiveness was familiar to the user based on their top tracks and artists. Finally,
more strongly related to the recommendation algorithm. Playlists the last indicator contains the playlist order. This variable indicates
in the gmm condition were less positively evaluated compared to whether the track was among the list which the user evaluated
playlists in the other conditions even though the track evaluations firstly, secondly, or thirdly.
were similar on average. In other words, while participants eval-
uated tracks across conditions similarly, the playlist evaluations 5.2 Model results
differed substantially (see Figure 3). 5.2.1 Algorithm performance. The relation between recommenda-
tion scores and user evaluations of tracks is depicted in Figure 4.
The illustration indicates that differences exist between algorithms
in their performance on track evaluations. This is supported by an
analysis of variance (ANOVA), F (2, 171) = 36.8, p < .001. The graph
shows that for all algorithms, higher recommendation scores result
in higher user ratings, showing that indeed tracks that are predicted
to be liked better also get higher ratings. However, consistent with
Figure 3, the scores for the base condition are consistently higher
than for the other two algorithms. For the дenre condition the slope
seems to be steeper than for the other two conditions, showing that
in this condition, user ratings are more sensitive to the predicted
recommendation scores.

Figure 3: Participants’ subjective evaluations of individual
tracks (left) and playlists (right). The error bars indicate the
standard error.

5.1 Overview of statistical methods
The results are analyzed using three methodologies. The first method-
ology concerns the performance of the recommendation algorithms.
This was analyzed using descriptive statistics concerning the rela-
tion between recommendation scores predicted by the algorithms
and the user ratings.
In the second methodology the relation between playlist evalua-
tions and track ratings was aggregated on the playlist-level (i.e. 3
observations per user). In this methodology, an aggregate measure
for track evaluation was used, more specifically, three aggrega-
tion measures: mean (Model 1), peak-end (Model 2), occurrence of Figure 4: The relation between subjective user ratings and
at least a 3-star rating (Model 3). Using these aggregates, a linear recommendation scores predicted by the algorithms. The
mixed-effects model was used such that variation in participants’ an- user ratings are slightly jittered for the scatterplot only. The
swering style can be included as a random-effects variable. Playlist shaded area represents the 95% confidence interval.
diversity and the recommendation approaches were included as
fixed-effects variables in Model 1a, Model 2a and Model 3a, and
interaction-effects were included in Model 1b, Model 2b and Model 5.2.2 Playlist-level relation between track evaluations and playlist
3b. evaluations. In this analysis, the effect of track evaluations on
Finally, the last methodology explores how variations within the playlist evaluations is explored on a playlist-level, using three dif-
track-level may explain playlist attractiveness. This analysis used ferent aggregation measures (Models 1-3).
a linear mixed-effects model on the track level (i.e. 3x10 observa- The effect of track evaluations on playlist attractiveness is illus-
tions per user) (see Table 3: Model 4) with participants modelled trated in Figure 5. All three aggregation measures are very similar
as a random-effects variable, similar to the playlist-level analysis. in predicting playlist attractiveness (see Table 2). We see a positive
For the track-level variables four types of indicators were included effect of the aggregating measure, indicating that if a user scores
additional to the rating, condition, and diversity. The first indicator higher on that measure, she also finds the playlist more attractive,
indicates whether the track was high-ranked (top 3 recommenda- together with negative effects of the conditions дenre and дmm
tion) or low-ranked (top 20 to 300). The second indicates for each consistent with the effect in Figure 3 that дmm and дenre score
track whether it was the highest rating of the playlist. Thus, if a lower than the base condition. The aggregate indicates occurrence
user gave two 4-star ratings and 8 lower ratings, the variable would of at least a 3-star rating (model 3) is a slightly worse predictor
IntRS ’19, September 19, 2019, Copenhagen, Denmark Hadash et al.

Table 2: Playlist attractiveness by aggregated track evaluations (playlist-level)

Mean Peak-end Positive
Model 1a Model 1b Model 2a Model 2b Model 3a Model 3b
rating (aggregate) 0.319∗∗∗ 0.071 0.274∗∗ 0.098 0.104∗ 0.022
(0.095) (0.165) (0.091) (0.151) (0.043) (0.071)
genre −0.090∗ −0.665∗∗∗ −0.081∗ −0.643∗∗∗ −0.095∗ −0.503∗∗∗
(0.039) (0.174) (0.039) (0.188) (0.038) (0.139)
gmm −0.364∗∗∗ −0.741∗∗∗ −0.351∗∗∗ −0.840∗∗∗ −0.356∗∗∗ −0.730∗∗∗
(0.039) (0.175) (0.038) (0.194) (0.038) (0.129)
diversity −0.059 −0.416∗∗ −0.067 −0.424∗∗ −0.078 −0.419∗∗
(0.074) (0.133) (0.074) (0.132) (0.074) (0.134)
rating (aggregate):genre 0.581∗ 0.422∗ 0.162
(0.228) (0.215) (0.101)
rating (aggregate):gmm 0.127 0.230 0.118
(0.224) (0.217) (0.101)
genre:diversity 0.428∗ 0.444∗ 0.481∗∗
(0.183) (0.183) (0.185)
gmm:diversity 0.526∗∗ 0.546∗∗ 0.475∗∗
(0.174) (0.174) (0.178)
Constant 0.512∗∗∗ 0.865∗∗∗ 0.492∗∗∗ 0.836∗∗∗ 0.620∗∗∗ 0.889∗∗∗
(0.076) (0.135) (0.086) (0.141) (0.062) (0.102)
N 176 176 176 176 176 176
Log Likelihood 17.052 24.794 17.007 23.711 15.602 21.237
AIC −20.105 −27.588 −20.013 −25.421 −17.204 −20.475
BIC 2.089 7.287 2.180 9.454 4.990 14.401
2
RG .351 .409 .342 .401 .330 .383
LM M (m)
Random Effect
# of Participants 59 59 59 59 59 59
Participant SD 0.063 0.053 0.08 0.054 0.083 0.064
Note. SD = standard deviation. The models are grouped by the method used for aggregating track evaluations.
’Mean’ = mean value, ’peak-end’ = average of highest rating and the last rating, ’positive’ = indicator for
occurrence of at least a 3-star evaluation. ∗∗∗ p < .001; ∗∗ p < .01; ∗ p < .05.

for playlist attractiveness compared to the mean and peak-end of the recommended playlist. For instance, a person may listen to
measures. different genres during varying activities like working and sporting.
When the interaction-effects are included, the main-effect of The recommendations could then include music based on all these
ratings is no longer significant (models 1b, 2b and 3b) but we get genres. While all recommendations are then closely related to the
several interactions of ratings with condition and condition with users’ preferences and could receive potentially high evaluations,
diversity. The interaction-effects of condition with perceived diver- the playlist may not be very attractive due to the diversity in the
sity and track evaluations are visualized in Figure 6 by separating genres.
the resulting effects by condition and we will discuss each condition In the дenre condition, perceived diversity had no effect on
and it’s interactions separately. playlist attractiveness. In this condition track evaluations strongly
The track evaluations had no effect on playlist evaluation in the predicted playlist attractiveness regardless of diversity. The results
base condition (they do for the other two conditions, as we will show that though the дenre playlist on average get a lower at-
see below). Moreover, in the base condition, perceived diversity tractiveness score than the base, this effect is reduced when the
has a negative effect, indicating that playlists with high perceived aggregate ratings of the list are higher: in other words, only if users
diversity were less attractive compared to playlists with low per- like the дenre tracks, they like the playlist as much as the base one
ceived diversity. One potential explanation could be that since these that has more low-spread, familiar tracks.
playlists were constructed using a low-spread approach the recom- The дmm condition had similar results as the дenre condition.
mendations were closely related to the users’ known preferences Perceived diversity predicted attractiveness only marginally. How-
(i.e. their top tracks that feed our algorithms). Therefore, the diver- ever, while the track evaluations strongly predict attractiveness in
sity in these users’ preferences may have influenced the diversity
How playlist evaluation compares to track evaluations in music recommender systems IntRS ’19, September 19, 2019, Copenhagen, Denmark

the mean rating, the peak-end value or the fact that at least one
track is highly rated. We see that some conditions are more sensi-
tive to these aggregate rating (дenre) than the others. We also see
an important (negative) role of diversity for the base condition in
predicting overall attractiveness, but no effect in the other two con-
ditions. In other words, different aspects affect playlist evaluation as
recognized in the literature, but this highly depends on the nature
of the underlying algorithm generating the recommendations.

Table 3: Playlist attractiveness by track evaluations (track-
level)

Model 4
rating −0.009
Figure 5: Playlist attractiveness by track rating (mean). The (0.020)
dot size indicates the number of duplicate items of the genre −0.095∗∗∗
playlist in the playlists of the other conditions. (0.009)
gmm −0.352∗∗∗
(0.009)
diversity −0.027∗∗∗
(0.006)
high-ranked 0.003
(0.009)
highest rating 0.002
(0.012)
familiar −0.011
(0.012)
playlist order 0.012∗
(0.005)
Constant 0.704∗∗∗
(0.029)
N 1850
Log Likelihood 630.272
AIC −1238.544
BIC −1177.791
2
RGLM .307
M (m)
Random Effect
# of Participants 58
Participant SD 0.156
Figure 6: Linear model of playlist attractiveness by track rat-
Note. SD = standard deviation, ’High-ranked’ indicates the track
ings and condition for each condition.
was one of the top-3 recommendations, ’highest rating’ indi-
cates the track received the highest rating within that playlist
the дenre condition, it is only a weak predictor in the дmm condi- for the participant, ’familiar’ indicates whether the track was
tion. In other words, high aggregate ratings cannot really make up known to be familiar to the participant, ’playlist order’ indicates
for the fact that the дmm list in general is evaluated worse than the whether the playlist was the first (=1), second (=2), or third (=3)
base list. As in the дenre condition this recommendation algorithm list that the participant evaluated. Interaction terms as in Mod-
uses a high-spread approach and includes novel track recommen- els 1-3 were omitted due to similarity to these models. ∗∗∗ p <
dations. However, the дmm recommended tracks based on audio .001; ∗∗ p < .01; ∗ p < .05.
feature similarity is in contrast to genre similarity. Regardless of di-
versity or individual track evaluations, playlists using this approach 5.2.3 Track-level relation between track evaluations and playlist eval-
were less attractive to participants. uations. In this analysis, the effect of track evaluations on playlist
Overall we find that overall attractiveness of a playlist is not evaluations is explored at track-level, trying to predict the overall
always directly related to the liking of the individual tracks, as attractiveness of each list with the individual track ratings, rather
reflected by the aggregate ratings of the tracks, whether this is than the aggregate ratings. The results are shown in Table 3. Four
IntRS ’19, September 19, 2019, Copenhagen, Denmark Hadash et al.

types of track-level variables are included in the analysis as de- Another explanation may be the methodology of evaluation.
scribed in Section 5.1. While tracks are evaluated at the moment they are experienced,
Whether a track is high ranked or received the highest rat- playlist evaluation occurs only after the tracks are experienced.
ing shows no significant effect on perceived attractiveness of the Therefore, playlist evaluations are based on what users remember
playlist. The track-level objective familiarity measures if the user from the list. This difference may lead to differences in user eval-
is familiar with the artists of a track. The user is familiar with a uation styles. Although this may explain why differences occur
track if at least one artist of the track also appears in the user’s top between track and playlist evaluations, it cannot explain why the
listened tracks related artists. Although we expected there would different recommendation approaches lead to different playlist at-
be a positive effect of familiarity on playlist attractiveness (as also tractiveness evaluations. Furthermore, using this explanation we
shown in [7]), there was no significant effect observed in model 4. A would have expected a model improvement from the inclusion of
possible reason could be the objective familiarity measure was not the peak-end measure. The peak-end measure specifically models
sufficient to cover all tracks that the user is familiar with since it is how users remember different moments in their overall experience
only measured with the user’s top tracks (the number is at most 50 while listening to a playlist [26]. However, peak-end resulted in
for each user). In our future work, we are planning to directly ask similar effects as using a standard mean-aggregation rating.
for (self-reported) familiarity, rather than calculating these from Regardless of the explanation, the results show that playlist
the data. We also calculated a familiarity score for each track (how attractiveness is not primarily related to the likeability of its tracks
much the user is familiar with the track). We found that there was a but that other factors such as diversity can play a role.
positive correlation between objective familiarity and track ratings
(r s (1770) = 0.326, p < .001): users give higher ratings to tracks 7 CONCLUSION AND FUTURE WORK
they are more familiar with, which is in line with previous work
While playlist evaluations can be partly predicted by evaluations of
on mere exposure effect [1].
its tracks, other factors of the playlist are more predictive. People
Playlist order is also a weak predictor of playlist attractiveness.
seem to evaluate playlists on other aspects than merely its tracks.
Participants perceive the last playlist as the most attractive and the
Even when individual tracks were rated positively, the playlist
first as the least attractive. However, when interaction terms as in
attractiveness could be low.
models 1-3 are included the effect is no longer significant. We also
We found that both diversity and recommendation approach
checked the condition orders generated by the random generator
affected playlist attractiveness. Diversity had a negative effect on
and found that each condition order occurred approximately equally
playlist attractiveness in recommenders using a low-spread method-
often. In other words, the effect of condition order can not explain
ology. The track ratings were the most predictive for the playlist
difference across conditions.
attractiveness in the recommendation approach based on genre
similarity. Furthermore, inclusion of the highest and last track eval-
6 DISCUSSION OF RESULTS uation score (peak-end) was sufficient to predict playlist attractive-
ness, performing just as well as the mean of the ratings.
We found that participants evaluate playlists on more aspects than
When evaluating recommendation approaches in music recom-
merely the likeability of its tracks. Even though the tracks in rec-
menders, it is important to consider which evaluation metric to
ommended playlists may be accurate and receive positive user
use. Music is often consumed in succession leading to many factors
evaluations, playlists can still be evaluated negatively. In particu-
other than track likeability that may influence whether people have
lar, the recommendation approach itself plays a role in the overall
satisfactory experiences. Although individual track evaluations are
perceived playlist attractiveness.
often used in recommender evaluation, track evaluations do not
One explanation may be that users have different distinct musi-
seem to predict playlist attractiveness very consistently.
cal styles. Playlists that contain music from more than one of the
While we showed that playlist attractiveness is not primarily
users’ styles may be less attractive to the user even though the
related to track evaluations, we were unable to effectively measure
track recommendations are accurate. Playlists in the base condition
why certain algorithms generated more attractive playlists com-
are most attractive, but suffer most from diversity. Users with mul-
pared to others. This question will be addressed in future work.
tiple musical styles may have received playlists with music from
We intent to include a subjective measure for track familiarity. Fur-
multiple styles which could have been reflected in the perceived
thermore, we will identify and attempt to separate distinct musical
diversity of the playlist. Playlists from the дenre condition were
styles within user preferences. For example, we could give users
also based on genre similarity, in addition to the track and artist
control about which top artists or top tracks they would like to use
similarity. Therefore, if multiple musical styles are present in the
to generate recommendations as in [11] to separate the tracks and
user preferences, it is more likely in the дenre condition that the
artists they like under different context.
musical style with the highest overall contribution overrules the
music from the other musical styles. Furthermore, the дmm con-
dition is least attractive. The recommendation algorithm used in REFERENCES
[1] Luke Barrington, Reid Oda, and Gert RG Lanckriet. 2009. Smarter than Genius?
this condition is based on audio feature similarity. Although tracks Human Evaluation of Music Recommender Systems.. In ISMIR, Vol. 9. Citeseer,
recommended in this condition were similar to the user preferences 357–362.
based on the audio features, they could be dissimilar based on more [2] Dmitry Bogdanov, MartíN Haro, Ferdinand Fuhrmann, Anna Xambó, Emilia
Gómez, and Perfecto Herrera. 2013. Semantic audio content-based music recom-
comprehensible attributes like genre and artists. It is likely that mendation and visualization based on user preference examples. Information
music from multiple musical styles were present in these playlists. Processing & Management 49, 1 (2013), 13–33.
How playlist evaluation compares to track evaluations in music recommender systems IntRS ’19, September 19, 2019, Copenhagen, Denmark

[3] Dirk Bollen, Bart P Knijnenburg, Martijn C Willemsen, and Mark Graus. 2010. User Modeling and User-Adapted Interaction 22, 4-5 (2012), 441–504.
Understanding choice overload in recommender systems. In Proceedings of the [16] Arto Lehtiniemi and Jukka Holm. 2011. Easy Access to Recommendation Playlists:
fourth ACM conference on Recommender systems. ACM, 63–70. Selecting Music by Exploring Preview Clips in Album Cover Space. Proceedings
[4] Òscar Celma and Perfecto Herrera. 2008. A new approach to evaluating novel of the 10th International Conference on Mobile and Ubiquitous Multimedia (2011),
recommendations. In Proceedings of the 2008 ACM conference on Recommender 94–99. https://doi.org/10.1145/2107596.2107607
systems. ACM, 179–186. [17] Martijn Millecamp, Nyi Nyi Htun, Yucheng Jin, and Katrien Verbert. 2018. Con-
[5] Zhiyong Cheng. 2011. Just-for-Me : An Adaptive Personalization System for trolling Spotify Recommendations. (2018), 101–109. https://doi.org/10.1145/
Location-Aware Social Music Recommendation Categories and Subject Descrip- 3209219.3209223
tors. (2011). [18] Daniel Müllensiefen, Bruno Gingras, Lauren Stewart, and Jason Ji. 2013. Gold-
[6] Michael D. Ekstrand, F. Maxwell Harper, Martijn C. Willemsen, and Joseph A. smiths Musical Sophistication Index (Gold-MSI) v1.0: Technical Report and Doc-
Konstan. 2014. User perception of differences in recommender algorithms. Pro- umentation Revision 0.3. Technical Report. Goldsmiths University of London,
ceedings of the 8th ACM Conference on Recommender systems - RecSys ’14 (2014), London. https://www.gold.ac.uk/music-mind-brain/gold-msi/
161–168. https://doi.org/10.1145/2645710.2645737 [19] F. Pachet, G. Westermann, and D. Laigre. 2001. Musical data mining for electronic
[7] Bruce Ferwerda, Mark P Graus, Andreu Vall, Marko Tkalcic, and Markus Schedl. music distribution. Proceedings - 1st International Conference on WEB Delivering
2017. How item discovery enabled by diversity leads to increased recommenda- of Music, WEDELMUSIC 2001 May 2014 (2001), 101–106. https://doi.org/10.1109/
tion list attractiveness. In Proceedings of the Symposium on Applied Computing. WDM.2001.990164
ACM, 1693–1696. [20] Steffen Pauws and Berry Eggen. 2003. Realization and user evaluation of an
[8] Sophia Hadash. 2019. Evaluating a framework for sequential group music recom- automatic playlist generator. Journal of new music research 32, 2 (2003), 179–192.
mendations: A Modular Framework for Dynamic Fairness and Coherence control. [21] Alexander Rozin, Paul Rozin, and Emily Goldberg. 2004. The feeling of music past:
Master. Eindhoven University of Technology. https://pure.tue.nl/ws/portalfiles/ How listeners remember musical affect. Music Perception: An Interdisciplinary
portal/122439578/Master_thesis_shadash_v1.0.1_1_.pdf Journal 22, 1 (2004), 15–39.
[9] Shobu Ikeda, Kenta Oku, and Kyoji Kawagoe. 2018. Music Playlist Recommen- [22] Thomas Schäfer, Doreen Zimmermann, and Peter Sedlmeier. 2014. How we
dation Using Acoustic-Feature Transition Inside the Songs. (2018), 216–219. remember the emotional intensity of past musical experiences. Frontiers in
https://doi.org/10.1145/3151848.3151880 Psychology 5 (2014), 911.
[10] Tristan Jehan and David Desroches. 2004. Analyzer Documentation [ver- [23] Markus Schedl, Arthur Flexer, and Julián Urbano. 2013. The neglected user in
sion 3.2]. Technical Report. The Echo Nest Corporation, Somerville, music information retrieval research. Journal of Intelligent Information Systems
MA. http://docs.echonest.com.s3-website-us-east-1.amazonaws.com/_static/ 41, 3 (2013), 523–539.
AnalyzeDocumentation.pdf [24] Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi
[11] Yucheng Jin, Bruno Cardoso, and Katrien Verbert. 2017. How do different levels Elahi. 2018. Current challenges and visions in music recommender systems
of user control affect cognitive load and acceptance of recommendations?. In research. International Journal of Multimedia Information Retrieval 7, 2 (2018),
CEUR Workshop Proceedings, Vol. 1884. CEUR Workshop Proceedings, 35–42. 95–116.
[12] Yucheng Jin, Nava Tintarev, and Katrien Verbert. 2018. Effects of personal char- [25] Morgan K Ward, Joseph K Goodman, and Julie R Irwin. 2014. The same old song:
acteristics on music recommender systems with different levels of controllability. The power of familiarity in music choice. Marketing Letters 25, 1 (2014), 1–11.
(2018), 13–21. https://doi.org/10.1145/3240323.3240358 [26] Eelco C. E. J. Wiechert. 2018. The peak-end effect in musical playlist experiences.
[13] Daniel Kahneman. 2011. Thinking, fast and slow. Macmillan. Master. Eindhoven University of Technology.
[14] Iman Kamehkhosh and Dietmar Jannach. 2017. User Perception of Next-Track [27] Martijn C. Willemsen, Mark P. Graus, and Bart P. Knijnenburg. 2016. Under-
Music Recommendations. (2017), 113–121. https://doi.org/10.1145/3079628. standing the role of latent feature diversification on choice difficulty and sat-
3079668 isfaction. User Modelling and User-Adapted Interaction 26, 4 (2016), 347–389.
[15] Bart P Knijnenburg, Martijn C Willemsen, Zeno Gantner, Hakan Soncu, and https://doi.org/10.1007/s11257-016-9178-6
Chris Newell. 2012. Explaining the user experience of recommender systems.