=Paper= {{Paper |id=Vol-2450/paper1 |storemode=property |title=How Playlist Evaluation Compares to Track Evaluations in Music Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-2450/paper1.pdf |volume=Vol-2450 |authors=Sophia Hadash,Yu Liang,Martijn C. Willemsen |dblpUrl=https://dblp.org/rec/conf/recsys/HadashLW19 }} ==How Playlist Evaluation Compares to Track Evaluations in Music Recommender Systems== https://ceur-ws.org/Vol-2450/paper1.pdf
 How playlist evaluation compares to track evaluations in music
                     recommender systems
                 Sophia Hadash                                               Yu Liang                            Martijn C. Willemsen
    Jheronimus Academy of Data Science                     Jheronimus Academy of Data Science              Eindhoven University of Technology
       5211 DA ’s-Hertogenbosch, The                          5211 DA ’s-Hertogenbosch, The               5600 MB Eindhoven, The Netherlands
                Netherlands                                            Netherlands                        Jheronimus Academy of Data Science
              s.hadash@tue.nl                                        y.liang1@tue.nl                         5211 DA ’s-Hertogenbosch, The
                                                                                                                      Netherlands
                                                                                                                  m.c.willemsen@tue.nl
ABSTRACT                                                                            ACM Reference Format:
Most recommendation evaluations in music domain are focused                         Sophia Hadash, Yu Liang, and Martijn C. Willemsen. 2019. How playlist
                                                                                    evaluation compares to track evaluations in music recommender systems.
on algorithmic performance: how a recommendation algorithm
                                                                                    In Proceedings of Joint Workshop on Interfaces and Human Decision Making
could predict a user’s liking of an individual track. However, indi-                for Recommender Systems (IntRS ’19). CEUR-WS.org, 9 pages.
vidual track rating might not fully reflect the user’s liking of the
whole recommendation list. Previous work has shown that sub-
jective measures such as perceived diversity and familiarity of the
recommendations, as well as the peak-end effect can influence the                   1    INTRODUCTION
user’s overall (holistic) evaluation of the list. In this study, we inves-          In user-centric evaluation of personalized music recommendation,
tigate how individual track evaluation compares to holistic playlist                users are usually asked to indicate their degree of liking of individ-
evaluation in music recommender systems, especially how playlist                    ual tracks [2, 4, 5, 14] or by providing a holistic assessment of the
attractiveness is related to individual track rating and other subjec-              entire playlist (e.g. playlist satisfaction or playlist attractiveness)
tive measures (perceived diversity) or objective measures (objective                [6, 9, 12, 16, 17] generated by the recommendation approaches.
familiarity, peak-end effect and occurrence of good recommenda-                     Most recommender evaluations are focused on the first type of
tions in the list). We explore this relation using a within-subjects                evaluation to test algorithmic performance: can we accurately pre-
online user experiment, in which recommendations for each condi-                    dict the liking of an individual track. Many user-centric studies in
tion are generated by different algorithms. We found that individual                the field [15] however focus on the second metric, does the list of
track ratings can not fully predict playlist evaluations, as other fac-             recommendations provide a satisfactory experience. Often these
tors such as perceived diversity and recommendation approaches                      studies find playlist satisfaction is not just about the objective or
can influence playlist attractiveness to a larger extent. In addition,              subjective accuracy of the playlist, but also depends on the difficulty
inclusion of the highest and last track rating (peak-end) is equally                of choosing from the playlist or playlist diversity [27]. For example,
good in predicting playlist attractiveness as the inclusion of all                  in the music domain perceived diversity of the playlist has been
track evaluations. Our results imply that it is important to consider               shown to have a negative effect on overall playlist attractiveness [7].
which evaluation metric to use when evaluating recommendation                       Bollen et al. [3] showed people were just as satisfied with a list of
approaches.                                                                         20 movie recommendations which included the top-5 list and a set
                                                                                    of lower ranked items (twenty’s item being the 1500th best rank)
CCS CONCEPTS                                                                        as with a list of the best 20 recommendations (top-20 ranked).
                                                                                       Research in psychology also shows that people’s memory of
• Human-centered computing → Empirical studies in HCI;                              overall experience is influenced by the largest peak and end of
Heuristic evaluations; • Information systems → Recommender                          the experience rather than the average of the moment to moment
systems; Relevance assessment; Personalization.                                     experience [13]. Similar effects might occur when we ask users to
                                                                                    evaluate holistically a list on attractiveness: they might be triggered
KEYWORDS                                                                            more by particular items in the list (i.e. ones that they recognize
User-centric evaluation, recommender systems, playlist and track                    as great (or bad), ones that are familiar rather than ones that are
evaluation                                                                          unknown, cf. mere exposure effect [25]) and therefore their overall
                                                                                    impression might not simply be the mean of the individual ratings.
                                                                                       These results from earlier recommender research and from psy-
                                                                                    chological research suggest that overall (holistic) playlist evaluation
                                                                                    is not just reflected by the average of liking or rating of the individ-
IntRS ’19, September 19, 2019, Copenhagen, Denmark
Copyright © 2019 for this paper by its authors. Use permitted under Creative Com-
                                                                                    ual items. However, to our best knowledge, no previous work has
mons License Attribution 4.0 International (CC BY 4.0). IntRS ’19: Joint Workshop   explored the relation between users’ evaluation of individual tracks
on Interfaces and Human Decision Making for Recommender Systems, 19 Sept 2019,      and overall playlist evaluation. To some extent this is because it is
Copenhagen, DK
.                                                                                   not common that both types of data are collected in the same study.
                                                                                    Therefore, in this work, we would like to investigate how individual
IntRS ’19, September 19, 2019, Copenhagen, Denmark                                                                                   Hadash et al.


item evaluations relate to holistic evaluations in sequential music       list can help users to discover new music and enrich their music
recommender systems.                                                      tastes.
   We explore these relations using a within-subject online experi-           The novel contribution of this work is that we include both mea-
ment, in which users are asked to give individual ratings as well as      surements in the study for personalized music recommendations,
overall perception of playlist attractiveness and diversity in three      aiming to uncover the relation between individual track evaluation
conditions: (1) track and artist similarity algorithm (base), (2) track   and holistic evaluation of music playlists.
and artist similarity algorithm combined with genre similarity algo-
rithm (genre) and (3) track and artist similarity algorithm combined      2.2     Peak-end effect
with audio feature algorithm (gmm). The track and artist similarity       Research in psychology has looked into the differences between
algorithm can be regarded as a low-spread strategy since recom-           the ‘remembering self’ and the ‘experiencing self’ [13], as reflected
mendations are generated from a small subset of the total pool            in the peak-end rule: the memory of the overall experience of a
of tracks relatively close to the user’s tastes [8]. Both the genre       painful medical procedure is not simply the sum or average of the
approach and gmm approach are high-spread strategies which gen-           moment to moment experience, but the average of the largest peak
erates user-track ratings for a large proportion of the total pool of     and the end of the experience.
tracks.                                                                      In music domain, several studies have found that the remem-
   In this study, we are interested in how perceived attractiveness of    bered intensity of the music listening experience is highly corre-
the playlist is related to perceived playlist diversity and individual    lated with peak, peak-end and average moment-to-moment expe-
track ratings across the three conditions. In addition, we also include   rience [21, 22]. However, it is argued by Wiechert [26] that these
a set of objective features of the playlist in the analysis. We test      studies fail to consider users’ personal musical preferences and
whether that users’ perceived attractiveness of the playlist will         that the peak-end value and the average value measured in the
also be affected by (1) the peak-end effect: the track they like most     studies might be correlated with each other. Rather than giving
and the end track, (2) their familiarity to the recommendations in        participants the same stimuli, Wiechert gave participants a list of
the playlist and (3) occurrences of good recommendations in the           songs based on their current musical preference and came up with a
playlist: people might be satisfied with a playlist as long as at least   new metric: pure peak-end value (the difference between peak-end
some recommendations are good.                                            and average). He found that while the average experience could
                                                                          explain a significant part of playlist experience variance, the pure
2 RELATED WORK                                                            peak-end value could explain a part of variance that would not be
2.1 User-centric evaluation in music                                      explained by the average.
    recommendation
                                                                          3     METHOD
User-centric evaluation for recommendation approaches is neces-
sary in order to understand users’ perception of the given recom-         In this study three algorithms are used for generating playlists.
mendations [15], such as acceptance or satisfaction [23, 24].             These algorithms are designed to use user preferences in the form
    User-centic evaluation in music recommendation can be at indi-        of (ordered) lists of tracks, artists, or genres a user is known to
vidual track level or whole playlist level. Users’ perception towards     like. The advantage of using such an input form is that these algo-
the whole playlists are often measured under the context of auto-         rithms can be used with user preferences obtained from commercial
matic playlist generation [20], smooth track transition [9] or when       platforms. In this study Spotify 1 user profiles are used. These pref-
the goal is to evaluate the whole recommender system [12, 17]. For        erences are in the form of ordered lists of top tracks and artists. The
example, users were asked to indicate their perception towards            first algorithm is based on track and artist similarity. The second al-
the recommended playlists to investigate how different settings of        gorithm uses a genre similarity metric based on genre co-occurrence
control in the recommender system influence their cognitive load          among artists. The third algorithm recommends tracks based on
as well as their acceptance to the recommendations [12]. However,         a Gaussian mixture model on track features derived from audio
when it comes to the evaluation of the recommendation algorithms,         analyses (see [10] for details). All algorithms are described in detail
users are often asked to indicate their ratings[2, 4] for each indi-      in [8].
vidual track rather than the playlist as a whole, neglecting the fact
that tracks are often listened in succession or within a playlist.        3.1     Track and artist similarity algorithm
    Individual item ratings can not fully reflect users’ degree of lik-   The track and artist similarity algorithm is a combination of the
ing towards the recommendation list. Perceived diversity is a factor      same sub-algorithm applied to both a list of tracks and artists a user
that can only be measured at the list level. Willemsen et al. [27]        is known to like. The input to this sub-algorithm is a list of items,
has shown that perceived diversity of a movie recommendation list         potentially ordered on user likeability. This sub-algorithm uses
has a positive effect on perceived list attractiveness and a higher       Spotify’s seed recommendation system to explore items that are
perceived diversity would make it easier for users to make a choice       similar to the input. Based on the occurrence of items in the results,
from the recommendations. Ekstrand et al. [6] also show that per-         an output list is generated with user likeability prediction scores.
ceived diversity has a positive effect on user satisfaction. While in     The algorithm is formulated in Algorithm 1, with an illustration by
music domain, Ferwerda et al. [7] found that perceived diversity has      example in Figure 1.
a negative effect on perceived attractiveness of the recommendation
list, however, this effect turns to positive when the recommendation      1 https://developer.spotify.com/documentation/web-api/
How playlist evaluation compares to track evaluations in music recommender systems                           IntRS ’19, September 19, 2019, Copenhagen, Denmark




                           Figure 1: Illustration of the track and artist similarity algorithm using an example.


Algorithm 1 Track and artist similarity                                              was known which genres he/she produced music in. The data is
si : score of item i, x: temporary score of item i, recs:                            extracted from Spotify’s developer API. The co-occurence analysis
recommendation set, N : number of sibling nodes, posnod e :                          generated a normalized symmetric similarity matrix D. The like-
position of the node in its parents’ children, smin , smax : scores                  ability scores of the user towards the list of genre is then computed
assigned to the last and first sibling node at the current tree depth.               as follows, where I is the identity matrix.
  1: for each item i in recs do                                                                                  S u = (D + I )S u′                         (1)
  2:    si = 0
  3:    for each node j as current_node where node j is item i do                    3.3    Audio feature algorithm
  4:       x = current_node.score                                                    The audio feature algorithm clusters tracks with similar audio
  5:       while current_node.parent not null do                                     features using a Gaussian mixture model (GMM). A database of
  6:          current_node = current_node.parent                                     n ≈ 500.000 tracks containing 11 audio analysis features were used
  7:          x = x ∗ current_node.score                                             to train the model. The audio features consisted of measures for
  8:       end while                                                                 danceability, energy, key, loudness, mode, speechiness, acousticness,
  9:       si = si + x                                                               instrumentalness, liveness, valence, and tempo. Multiple GMM’s
 10:    end for                                                                      were fitted using the expectation-maximization (EM) algorithm for
 11: end for                                                                         varying component numbers. The model with 21 components had
 12: return (recs, s) order by s descending                                          the lowest BIC criterion and was therefore selected. Then, cluster
13:                                                                                  likeability was computed as follows (see [8]):
14: def node.score:
              N −pos nod e
15:    return      N       (smax − smin ) + smin                                                                        NÕ
                                                                                                                         t op
                                                                                                                   1
                                                                                     p(user likes cluster i) =                  p(track j belonдs to cluster i)
                                                                                                                 Ntop j=1
3.2     Genre similarity algorithm                                                                                                                 (2)
                                                                                        Finally, the output recommendations favored tracks correspond-
The genre similarity algorithm uses an ordered list of genres the
                                                                                     ing to clusters with high user likeability probabilities.
user likes S u′ (a column vector which shows the user degree of
liking to all genres built from the user’s top artists) and a similarity
metric D to generate genre likeability scores for other genres. Then,
                                                                                     3.4    Familiarity of the recommendations to the
the resulting extrapolated list S u is used to favor recommendations                        users
from genres with high likeability scores.                                            Both the track and artist similarity and the genre similarity al-
   There are 1757 different types of genres available in our dataset,                gorithms generate recommendations close to the users’ known
therefore both S u′ and S u are column vectors of dimension 1757                     preferences. Recommendations are based on artists and genres that
and matrix D is of dimension 1757 × 1757.                                            are familiar to the user. The audio feature algorithm on the other
   The similarity metric is based on co-occurrence analysis of artists,              hand recommends tracks based on audio feature similarity. As a
similar to the methodology used in [19]. The co-occurrence analysis                  result, recommended tracks are more likely to have genres and
used a database consisting of n ≈ 80.000 artists. For each artist it                 artists that are less familiar to the users.
IntRS ’19, September 19, 2019, Copenhagen, Denmark                                                                                  Hadash et al.


4     EXPERIMENTAL DESIGN                                                  4.3    Study Procedure
To evaluate the relation between track evaluations and playlist            After consenting, participants were prompted with a login screen
evaluations, a within-subjects online experiment was conducted.            where they could connect their Spotify account with the study. Par-
The study included three conditions in randomized order: track and         ticipants who did not have a Spotify account or who had a Spotify
artist algorithm (base), track and artist algorithm combined with          account containing no user preference data could not continue with
the genre similarity algorithm (genre), and track and artist algo-         the study. After Spotify login, participants completed a background
rithm combined with the audio feature algorithm (gmm). In each             survey. In the survey they reported their Spotify usage and music
condition participants were presented with a playlist containing 10        sophistication.
tracks generated by the corresponding algorithm and evaluated the             Following the background survey, the user entered the track
individual tracks on likeability and personalization and the playlist      evaluation phase in which a playlist was presented to the user
as a whole on attractiveness and diversity. The playlist included          generated by one of the algorithms. The interface (see Figure 2)
the top 3 recommendations and the 20th , 40th , 60th , 80th , 100th ,      contained an interactive panel showing the tracks of the playlist, a
200th , and 300th recommendation in random order. Lower ranked             survey panel in which they had to rate the tracks, and a music con-
recommendations were included such that algorithm performance              trol bar. Participants could freely browse through the playlist while
could be evaluated more easily, as lower ranked recommendations            providing the ratings. After all ratings were provided, participants
should result in lower user evaluations.                                   entered the playlist evaluation phase in which they answered the
                                                                           playlist evaluations questions (Table 1). The track evaluation phase
4.1     Participant Recruitment                                            and playlist evaluation phase were then repeated for the remaining
Participants were primarily recruited using the JF Schouten partici-       conditions.
pant database of Eindhoven University of Technology. Some par-                Finally, participants were thanked for their time and were en-
ticipants were recruited by invitation. Participants were required         tered into a reward raffle. Among every 5 participants one par-
to have a Spotify account (free or Premium) and to have used this          ticipant received 15 euro compensation. In total the study lasted
account prior to taking part in the study.                                 approximately 15 minutes.

4.2     Materials
The track evaluations included likeability and personalization mea-
sures. One question was used for each of the tracks. This was
decided based on the repetitive nature of individual track evalua-
tions. The questions for measuring track likeability was: "Rate how
much you like the song". For measuring perceived track personaliza-
tion we used the following item: "Rate how well the song fits your
personal music preferences". Both questions were answered on a
5-point visual scale with halves (thus 10 actual options) containing
stars and heart icons as shown in Figure 2.
   The playlist evaluation included playlist attractiveness and playlist
diversity and is presented in Table 1.
   Additional scales used in the study were a demographics scale
and the Goldsmith Music Sophistication Index (MSI) [18]. The de-
mographics scale measured gender, age, and Spotify usage. Spotify
usage was measured using a single item: "I listen to Spotify for __
hours a week" with 7 range options.

              Table 1: The playlist evaluation scale

                                                                           Figure 2: Preview of the track rating screen as displayed to
      Concept          Item
                                                                           the participants during the study.
 Perceived             The playlist was attractive
 attractiveness        The playlist showed too many bad items
   Alpha: .94          The playlist matched my preferences                 5     RESULTS
                                                                           Participants in this study included 59 people, of which 54 were
 Perceived          The playlist was varied
                                                                           recruited through the JF Schouten database. The sample consisted
 diversity          The tracks differed a lot from each other on
                                                                           of 31 males and 28 females. The age of the participants ranged from
   Alpha: .85       different aspects
                                                                           19 to 64 (M = 25.6, SD = 8.8). On average participants listened to
                    All the tracks were similar to each other
                                                                           Spotify for 7 to 10 hours per week. MSI scores ranged between 0
 Note. The scale is adapted from [27].                                     and 5 (M = 2.18, SD = 1.0). The study took place between 9th of
                                                                           January and 1st of February of 2019.
How playlist evaluation compares to track evaluations in music recommender systems                            IntRS ’19, September 19, 2019, Copenhagen, Denmark


   We found that there was no effect of personalization rating on                    indicate those two tracks with a 1, otherwise 0. The third indicator
perceived attractiveness, while likability rating can partially predict              is the familiarity which shows whether a track was predicted to be
perceived attractiveness. Furthermore, playlist attractiveness was                   familiar to the user based on their top tracks and artists. Finally,
more strongly related to the recommendation algorithm. Playlists                     the last indicator contains the playlist order. This variable indicates
in the gmm condition were less positively evaluated compared to                      whether the track was among the list which the user evaluated
playlists in the other conditions even though the track evaluations                  firstly, secondly, or thirdly.
were similar on average. In other words, while participants eval-
uated tracks across conditions similarly, the playlist evaluations                   5.2    Model results
differed substantially (see Figure 3).                                               5.2.1 Algorithm performance. The relation between recommenda-
                                                                                     tion scores and user evaluations of tracks is depicted in Figure 4.
                                                                                     The illustration indicates that differences exist between algorithms
                                                                                     in their performance on track evaluations. This is supported by an
                                                                                     analysis of variance (ANOVA), F (2, 171) = 36.8, p < .001. The graph
                                                                                     shows that for all algorithms, higher recommendation scores result
                                                                                     in higher user ratings, showing that indeed tracks that are predicted
                                                                                     to be liked better also get higher ratings. However, consistent with
                                                                                     Figure 3, the scores for the base condition are consistently higher
                                                                                     than for the other two algorithms. For the дenre condition the slope
                                                                                     seems to be steeper than for the other two conditions, showing that
                                                                                     in this condition, user ratings are more sensitive to the predicted
                                                                                     recommendation scores.

Figure 3: Participants’ subjective evaluations of individual
tracks (left) and playlists (right). The error bars indicate the
standard error.



5.1     Overview of statistical methods
The results are analyzed using three methodologies. The first method-
ology concerns the performance of the recommendation algorithms.
This was analyzed using descriptive statistics concerning the rela-
tion between recommendation scores predicted by the algorithms
and the user ratings.
   In the second methodology the relation between playlist evalua-
tions and track ratings was aggregated on the playlist-level (i.e. 3
observations per user). In this methodology, an aggregate measure
for track evaluation was used, more specifically, three aggrega-
tion measures: mean (Model 1), peak-end (Model 2), occurrence of                     Figure 4: The relation between subjective user ratings and
at least a 3-star rating (Model 3). Using these aggregates, a linear                 recommendation scores predicted by the algorithms. The
mixed-effects model was used such that variation in participants’ an-                user ratings are slightly jittered for the scatterplot only. The
swering style can be included as a random-effects variable. Playlist                 shaded area represents the 95% confidence interval.
diversity and the recommendation approaches were included as
fixed-effects variables in Model 1a, Model 2a and Model 3a, and
interaction-effects were included in Model 1b, Model 2b and Model                    5.2.2 Playlist-level relation between track evaluations and playlist
3b.                                                                                  evaluations. In this analysis, the effect of track evaluations on
   Finally, the last methodology explores how variations within the                  playlist evaluations is explored on a playlist-level, using three dif-
track-level may explain playlist attractiveness. This analysis used                  ferent aggregation measures (Models 1-3).
a linear mixed-effects model on the track level (i.e. 3x10 observa-                     The effect of track evaluations on playlist attractiveness is illus-
tions per user) (see Table 3: Model 4) with participants modelled                    trated in Figure 5. All three aggregation measures are very similar
as a random-effects variable, similar to the playlist-level analysis.                in predicting playlist attractiveness (see Table 2). We see a positive
For the track-level variables four types of indicators were included                 effect of the aggregating measure, indicating that if a user scores
additional to the rating, condition, and diversity. The first indicator              higher on that measure, she also finds the playlist more attractive,
indicates whether the track was high-ranked (top 3 recommenda-                       together with negative effects of the conditions дenre and дmm
tion) or low-ranked (top 20 to 300). The second indicates for each                   consistent with the effect in Figure 3 that дmm and дenre score
track whether it was the highest rating of the playlist. Thus, if a                  lower than the base condition. The aggregate indicates occurrence
user gave two 4-star ratings and 8 lower ratings, the variable would                 of at least a 3-star rating (model 3) is a slightly worse predictor
IntRS ’19, September 19, 2019, Copenhagen, Denmark                                                                                         Hadash et al.


                              Table 2: Playlist attractiveness by aggregated track evaluations (playlist-level)


                                                                Mean                 Peak-end                          Positive
                                                 Model 1a          Model 1b    Model 2a     Model 2b       Model 3a         Model 3b
               rating (aggregate)                    0.319∗∗∗         0.071     0.274∗∗        0.098          0.104∗            0.022
                                                  (0.095)            (0.165)    (0.091)       (0.151)        (0.043)           (0.071)
               genre                              −0.090∗          −0.665∗∗∗    −0.081∗      −0.643∗∗∗       −0.095∗         −0.503∗∗∗
                                                  (0.039)            (0.174)    (0.039)       (0.188)        (0.038)           (0.139)
               gmm                               −0.364∗∗∗         −0.741∗∗∗   −0.351∗∗∗     −0.840∗∗∗      −0.356∗∗∗        −0.730∗∗∗
                                                  (0.039)            (0.175)    (0.038)       (0.194)        (0.038)           (0.129)
               diversity                          −0.059           −0.416∗∗     −0.067       −0.424∗∗        −0.078          −0.419∗∗
                                                  (0.074)            (0.133)    (0.074)       (0.132)        (0.074)           (0.134)
               rating (aggregate):genre                              0.581∗                    0.422∗                           0.162
                                                                     (0.228)                  (0.215)                          (0.101)
               rating (aggregate):gmm                                 0.127                    0.230                            0.118
                                                                     (0.224)                  (0.217)                          (0.101)
               genre:diversity                                       0.428∗                    0.444∗                         0.481∗∗
                                                                     (0.183)                  (0.183)                          (0.185)
               gmm:diversity                                        0.526∗∗                   0.546∗∗                         0.475∗∗
                                                                     (0.174)                  (0.174)                          (0.178)
               Constant                              0.512∗∗∗       0.865∗∗∗   0.492∗∗∗      0.836∗∗∗        0.620∗∗∗         0.889∗∗∗
                                                      (0.076)        (0.135)    (0.086)       (0.141)         (0.062)          (0.102)
               N                                        176            176        176           176             176              176
               Log Likelihood                         17.052         24.794     17.007        23.711          15.602           21.237
               AIC                                   −20.105        −27.588    −20.013        −25.421        −17.204          −20.475
               BIC                                     2.089          7.287      2.180         9.454           4.990           14.401
                2
               RG                                      .351            .409      .342           .401            .330             .383
                  LM M (m)
               Random Effect
               # of Participants                        59              59        59             59            59                   59
               Participant SD                         0.063            0.053     0.08          0.054          0.083               0.064
                 Note. SD = standard deviation. The models are grouped by the method used for aggregating track evaluations.
                 ’Mean’ = mean value, ’peak-end’ = average of highest rating and the last rating, ’positive’ = indicator for
                 occurrence of at least a 3-star evaluation. ∗∗∗ p < .001; ∗∗ p < .01; ∗ p < .05.


for playlist attractiveness compared to the mean and peak-end                   of the recommended playlist. For instance, a person may listen to
measures.                                                                       different genres during varying activities like working and sporting.
    When the interaction-effects are included, the main-effect of               The recommendations could then include music based on all these
ratings is no longer significant (models 1b, 2b and 3b) but we get              genres. While all recommendations are then closely related to the
several interactions of ratings with condition and condition with               users’ preferences and could receive potentially high evaluations,
diversity. The interaction-effects of condition with perceived diver-           the playlist may not be very attractive due to the diversity in the
sity and track evaluations are visualized in Figure 6 by separating             genres.
the resulting effects by condition and we will discuss each condition              In the дenre condition, perceived diversity had no effect on
and it’s interactions separately.                                               playlist attractiveness. In this condition track evaluations strongly
    The track evaluations had no effect on playlist evaluation in the           predicted playlist attractiveness regardless of diversity. The results
base condition (they do for the other two conditions, as we will                show that though the дenre playlist on average get a lower at-
see below). Moreover, in the base condition, perceived diversity                tractiveness score than the base, this effect is reduced when the
has a negative effect, indicating that playlists with high perceived            aggregate ratings of the list are higher: in other words, only if users
diversity were less attractive compared to playlists with low per-              like the дenre tracks, they like the playlist as much as the base one
ceived diversity. One potential explanation could be that since these           that has more low-spread, familiar tracks.
playlists were constructed using a low-spread approach the recom-                  The дmm condition had similar results as the дenre condition.
mendations were closely related to the users’ known preferences                 Perceived diversity predicted attractiveness only marginally. How-
(i.e. their top tracks that feed our algorithms). Therefore, the diver-         ever, while the track evaluations strongly predict attractiveness in
sity in these users’ preferences may have influenced the diversity
How playlist evaluation compares to track evaluations in music recommender systems                            IntRS ’19, September 19, 2019, Copenhagen, Denmark


                                                                                     the mean rating, the peak-end value or the fact that at least one
                                                                                     track is highly rated. We see that some conditions are more sensi-
                                                                                     tive to these aggregate rating (дenre) than the others. We also see
                                                                                     an important (negative) role of diversity for the base condition in
                                                                                     predicting overall attractiveness, but no effect in the other two con-
                                                                                     ditions. In other words, different aspects affect playlist evaluation as
                                                                                     recognized in the literature, but this highly depends on the nature
                                                                                     of the underlying algorithm generating the recommendations.


                                                                                     Table 3: Playlist attractiveness by track evaluations (track-
                                                                                     level)


                                                                                                                               Model 4
                                                                                     rating                                     −0.009
Figure 5: Playlist attractiveness by track rating (mean). The                                                                   (0.020)
dot size indicates the number of duplicate items of the                              genre                                    −0.095∗∗∗
playlist in the playlists of the other conditions.                                                                              (0.009)
                                                                                     gmm                                      −0.352∗∗∗
                                                                                                                                (0.009)
                                                                                     diversity                                −0.027∗∗∗
                                                                                                                                (0.006)
                                                                                     high-ranked                                 0.003
                                                                                                                                (0.009)
                                                                                     highest rating                              0.002
                                                                                                                                (0.012)
                                                                                     familiar                                   −0.011
                                                                                                                                (0.012)
                                                                                     playlist order                             0.012∗
                                                                                                                                (0.005)
                                                                                     Constant                                  0.704∗∗∗
                                                                                                                                (0.029)
                                                                                     N                                           1850
                                                                                     Log Likelihood                            630.272
                                                                                     AIC                                      −1238.544
                                                                                     BIC                                      −1177.791
                                                                                      2
                                                                                     RGLM                                         .307
                                                                                           M (m)
                                                                                     Random Effect
                                                                                     # of Participants                             58
                                                                                     Participant SD                              0.156
Figure 6: Linear model of playlist attractiveness by track rat-
                                                                                      Note. SD = standard deviation, ’High-ranked’ indicates the track
ings and condition for each condition.
                                                                                      was one of the top-3 recommendations, ’highest rating’ indi-
                                                                                      cates the track received the highest rating within that playlist
the дenre condition, it is only a weak predictor in the дmm condi-                    for the participant, ’familiar’ indicates whether the track was
tion. In other words, high aggregate ratings cannot really make up                    known to be familiar to the participant, ’playlist order’ indicates
for the fact that the дmm list in general is evaluated worse than the                 whether the playlist was the first (=1), second (=2), or third (=3)
base list. As in the дenre condition this recommendation algorithm                    list that the participant evaluated. Interaction terms as in Mod-
uses a high-spread approach and includes novel track recommen-                        els 1-3 were omitted due to similarity to these models. ∗∗∗ p <
dations. However, the дmm recommended tracks based on audio                           .001; ∗∗ p < .01; ∗ p < .05.
feature similarity is in contrast to genre similarity. Regardless of di-
versity or individual track evaluations, playlists using this approach               5.2.3 Track-level relation between track evaluations and playlist eval-
were less attractive to participants.                                                uations. In this analysis, the effect of track evaluations on playlist
   Overall we find that overall attractiveness of a playlist is not                  evaluations is explored at track-level, trying to predict the overall
always directly related to the liking of the individual tracks, as                   attractiveness of each list with the individual track ratings, rather
reflected by the aggregate ratings of the tracks, whether this is                    than the aggregate ratings. The results are shown in Table 3. Four
IntRS ’19, September 19, 2019, Copenhagen, Denmark                                                                                             Hadash et al.


types of track-level variables are included in the analysis as de-             Another explanation may be the methodology of evaluation.
scribed in Section 5.1.                                                     While tracks are evaluated at the moment they are experienced,
     Whether a track is high ranked or received the highest rat-            playlist evaluation occurs only after the tracks are experienced.
ing shows no significant effect on perceived attractiveness of the          Therefore, playlist evaluations are based on what users remember
playlist. The track-level objective familiarity measures if the user        from the list. This difference may lead to differences in user eval-
is familiar with the artists of a track. The user is familiar with a        uation styles. Although this may explain why differences occur
track if at least one artist of the track also appears in the user’s top    between track and playlist evaluations, it cannot explain why the
listened tracks related artists. Although we expected there would           different recommendation approaches lead to different playlist at-
be a positive effect of familiarity on playlist attractiveness (as also     tractiveness evaluations. Furthermore, using this explanation we
shown in [7]), there was no significant effect observed in model 4. A       would have expected a model improvement from the inclusion of
possible reason could be the objective familiarity measure was not          the peak-end measure. The peak-end measure specifically models
sufficient to cover all tracks that the user is familiar with since it is   how users remember different moments in their overall experience
only measured with the user’s top tracks (the number is at most 50          while listening to a playlist [26]. However, peak-end resulted in
for each user). In our future work, we are planning to directly ask         similar effects as using a standard mean-aggregation rating.
for (self-reported) familiarity, rather than calculating these from            Regardless of the explanation, the results show that playlist
the data. We also calculated a familiarity score for each track (how        attractiveness is not primarily related to the likeability of its tracks
much the user is familiar with the track). We found that there was a        but that other factors such as diversity can play a role.
positive correlation between objective familiarity and track ratings
(r s (1770) = 0.326, p < .001): users give higher ratings to tracks         7    CONCLUSION AND FUTURE WORK
they are more familiar with, which is in line with previous work
                                                                            While playlist evaluations can be partly predicted by evaluations of
on mere exposure effect [1].
                                                                            its tracks, other factors of the playlist are more predictive. People
     Playlist order is also a weak predictor of playlist attractiveness.
                                                                            seem to evaluate playlists on other aspects than merely its tracks.
Participants perceive the last playlist as the most attractive and the
                                                                            Even when individual tracks were rated positively, the playlist
first as the least attractive. However, when interaction terms as in
                                                                            attractiveness could be low.
models 1-3 are included the effect is no longer significant. We also
                                                                                We found that both diversity and recommendation approach
checked the condition orders generated by the random generator
                                                                            affected playlist attractiveness. Diversity had a negative effect on
and found that each condition order occurred approximately equally
                                                                            playlist attractiveness in recommenders using a low-spread method-
often. In other words, the effect of condition order can not explain
                                                                            ology. The track ratings were the most predictive for the playlist
difference across conditions.
                                                                            attractiveness in the recommendation approach based on genre
                                                                            similarity. Furthermore, inclusion of the highest and last track eval-
6    DISCUSSION OF RESULTS                                                  uation score (peak-end) was sufficient to predict playlist attractive-
                                                                            ness, performing just as well as the mean of the ratings.
We found that participants evaluate playlists on more aspects than
                                                                                When evaluating recommendation approaches in music recom-
merely the likeability of its tracks. Even though the tracks in rec-
                                                                            menders, it is important to consider which evaluation metric to
ommended playlists may be accurate and receive positive user
                                                                            use. Music is often consumed in succession leading to many factors
evaluations, playlists can still be evaluated negatively. In particu-
                                                                            other than track likeability that may influence whether people have
lar, the recommendation approach itself plays a role in the overall
                                                                            satisfactory experiences. Although individual track evaluations are
perceived playlist attractiveness.
                                                                            often used in recommender evaluation, track evaluations do not
   One explanation may be that users have different distinct musi-
                                                                            seem to predict playlist attractiveness very consistently.
cal styles. Playlists that contain music from more than one of the
                                                                                While we showed that playlist attractiveness is not primarily
users’ styles may be less attractive to the user even though the
                                                                            related to track evaluations, we were unable to effectively measure
track recommendations are accurate. Playlists in the base condition
                                                                            why certain algorithms generated more attractive playlists com-
are most attractive, but suffer most from diversity. Users with mul-
                                                                            pared to others. This question will be addressed in future work.
tiple musical styles may have received playlists with music from
                                                                            We intent to include a subjective measure for track familiarity. Fur-
multiple styles which could have been reflected in the perceived
                                                                            thermore, we will identify and attempt to separate distinct musical
diversity of the playlist. Playlists from the дenre condition were
                                                                            styles within user preferences. For example, we could give users
also based on genre similarity, in addition to the track and artist
                                                                            control about which top artists or top tracks they would like to use
similarity. Therefore, if multiple musical styles are present in the
                                                                            to generate recommendations as in [11] to separate the tracks and
user preferences, it is more likely in the дenre condition that the
                                                                            artists they like under different context.
musical style with the highest overall contribution overrules the
music from the other musical styles. Furthermore, the дmm con-
dition is least attractive. The recommendation algorithm used in            REFERENCES
                                                                             [1] Luke Barrington, Reid Oda, and Gert RG Lanckriet. 2009. Smarter than Genius?
this condition is based on audio feature similarity. Although tracks             Human Evaluation of Music Recommender Systems.. In ISMIR, Vol. 9. Citeseer,
recommended in this condition were similar to the user preferences               357–362.
based on the audio features, they could be dissimilar based on more          [2] Dmitry Bogdanov, MartíN Haro, Ferdinand Fuhrmann, Anna Xambó, Emilia
                                                                                 Gómez, and Perfecto Herrera. 2013. Semantic audio content-based music recom-
comprehensible attributes like genre and artists. It is likely that              mendation and visualization based on user preference examples. Information
music from multiple musical styles were present in these playlists.              Processing & Management 49, 1 (2013), 13–33.
How playlist evaluation compares to track evaluations in music recommender systems                                       IntRS ’19, September 19, 2019, Copenhagen, Denmark


 [3] Dirk Bollen, Bart P Knijnenburg, Martijn C Willemsen, and Mark Graus. 2010.               User Modeling and User-Adapted Interaction 22, 4-5 (2012), 441–504.
     Understanding choice overload in recommender systems. In Proceedings of the          [16] Arto Lehtiniemi and Jukka Holm. 2011. Easy Access to Recommendation Playlists:
     fourth ACM conference on Recommender systems. ACM, 63–70.                                 Selecting Music by Exploring Preview Clips in Album Cover Space. Proceedings
 [4] Òscar Celma and Perfecto Herrera. 2008. A new approach to evaluating novel                of the 10th International Conference on Mobile and Ubiquitous Multimedia (2011),
     recommendations. In Proceedings of the 2008 ACM conference on Recommender                 94–99. https://doi.org/10.1145/2107596.2107607
     systems. ACM, 179–186.                                                               [17] Martijn Millecamp, Nyi Nyi Htun, Yucheng Jin, and Katrien Verbert. 2018. Con-
 [5] Zhiyong Cheng. 2011. Just-for-Me : An Adaptive Personalization System for                 trolling Spotify Recommendations. (2018), 101–109. https://doi.org/10.1145/
     Location-Aware Social Music Recommendation Categories and Subject Descrip-                3209219.3209223
     tors. (2011).                                                                        [18] Daniel Müllensiefen, Bruno Gingras, Lauren Stewart, and Jason Ji. 2013. Gold-
 [6] Michael D. Ekstrand, F. Maxwell Harper, Martijn C. Willemsen, and Joseph A.               smiths Musical Sophistication Index (Gold-MSI) v1.0: Technical Report and Doc-
     Konstan. 2014. User perception of differences in recommender algorithms. Pro-             umentation Revision 0.3. Technical Report. Goldsmiths University of London,
     ceedings of the 8th ACM Conference on Recommender systems - RecSys ’14 (2014),            London. https://www.gold.ac.uk/music-mind-brain/gold-msi/
     161–168. https://doi.org/10.1145/2645710.2645737                                     [19] F. Pachet, G. Westermann, and D. Laigre. 2001. Musical data mining for electronic
 [7] Bruce Ferwerda, Mark P Graus, Andreu Vall, Marko Tkalcic, and Markus Schedl.              music distribution. Proceedings - 1st International Conference on WEB Delivering
     2017. How item discovery enabled by diversity leads to increased recommenda-              of Music, WEDELMUSIC 2001 May 2014 (2001), 101–106. https://doi.org/10.1109/
     tion list attractiveness. In Proceedings of the Symposium on Applied Computing.           WDM.2001.990164
     ACM, 1693–1696.                                                                      [20] Steffen Pauws and Berry Eggen. 2003. Realization and user evaluation of an
 [8] Sophia Hadash. 2019. Evaluating a framework for sequential group music recom-             automatic playlist generator. Journal of new music research 32, 2 (2003), 179–192.
     mendations: A Modular Framework for Dynamic Fairness and Coherence control.          [21] Alexander Rozin, Paul Rozin, and Emily Goldberg. 2004. The feeling of music past:
     Master. Eindhoven University of Technology. https://pure.tue.nl/ws/portalfiles/           How listeners remember musical affect. Music Perception: An Interdisciplinary
     portal/122439578/Master_thesis_shadash_v1.0.1_1_.pdf                                      Journal 22, 1 (2004), 15–39.
 [9] Shobu Ikeda, Kenta Oku, and Kyoji Kawagoe. 2018. Music Playlist Recommen-            [22] Thomas Schäfer, Doreen Zimmermann, and Peter Sedlmeier. 2014. How we
     dation Using Acoustic-Feature Transition Inside the Songs. (2018), 216–219.               remember the emotional intensity of past musical experiences. Frontiers in
     https://doi.org/10.1145/3151848.3151880                                                   Psychology 5 (2014), 911.
[10] Tristan Jehan and David Desroches. 2004. Analyzer Documentation [ver-                [23] Markus Schedl, Arthur Flexer, and Julián Urbano. 2013. The neglected user in
     sion 3.2].      Technical Report. The Echo Nest Corporation, Somerville,                  music information retrieval research. Journal of Intelligent Information Systems
     MA. http://docs.echonest.com.s3-website-us-east-1.amazonaws.com/_static/                  41, 3 (2013), 523–539.
     AnalyzeDocumentation.pdf                                                             [24] Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi
[11] Yucheng Jin, Bruno Cardoso, and Katrien Verbert. 2017. How do different levels            Elahi. 2018. Current challenges and visions in music recommender systems
     of user control affect cognitive load and acceptance of recommendations?. In              research. International Journal of Multimedia Information Retrieval 7, 2 (2018),
     CEUR Workshop Proceedings, Vol. 1884. CEUR Workshop Proceedings, 35–42.                   95–116.
[12] Yucheng Jin, Nava Tintarev, and Katrien Verbert. 2018. Effects of personal char-     [25] Morgan K Ward, Joseph K Goodman, and Julie R Irwin. 2014. The same old song:
     acteristics on music recommender systems with different levels of controllability.        The power of familiarity in music choice. Marketing Letters 25, 1 (2014), 1–11.
     (2018), 13–21. https://doi.org/10.1145/3240323.3240358                               [26] Eelco C. E. J. Wiechert. 2018. The peak-end effect in musical playlist experiences.
[13] Daniel Kahneman. 2011. Thinking, fast and slow. Macmillan.                                Master. Eindhoven University of Technology.
[14] Iman Kamehkhosh and Dietmar Jannach. 2017. User Perception of Next-Track             [27] Martijn C. Willemsen, Mark P. Graus, and Bart P. Knijnenburg. 2016. Under-
     Music Recommendations. (2017), 113–121. https://doi.org/10.1145/3079628.                  standing the role of latent feature diversification on choice difficulty and sat-
     3079668                                                                                   isfaction. User Modelling and User-Adapted Interaction 26, 4 (2016), 347–389.
[15] Bart P Knijnenburg, Martijn C Willemsen, Zeno Gantner, Hakan Soncu, and                   https://doi.org/10.1007/s11257-016-9178-6
     Chris Newell. 2012. Explaining the user experience of recommender systems.