=Paper= {{Paper |id=Vol-2225/paper2 |storemode=property |title=A Diversity Adjusting Strategy with Personality for Music Recommendation |pdfUrl=https://ceur-ws.org/Vol-2225/paper2.pdf |volume=Vol-2225 |authors=Feng Lu,Nava Tintarev |dblpUrl=https://dblp.org/rec/conf/recsys/LuT18 }} ==A Diversity Adjusting Strategy with Personality for Music Recommendation== https://ceur-ws.org/Vol-2225/paper2.pdf
       A Diversity Adjusting Strategy with Personality for Music
                           Recommendation
                               Feng Lu                                                         Nava Tintarev
                   Delft University of Technology                                      Delft University of Technology
                       Delft, the Netherlands                                              Delft, the Netherlands
                     F.Lu-1@student.tudelft.nl                                              n.tintarev@tudelft.nl
ABSTRACT                                                                 as to mitigate the ‘cold-start problem’. This paper combines these
Diversity-based recommender systems aim to select a wide range           two branches of research together.
of relevant content for users, but diversity needs for users with           To address this gap, we investigate how personality information
different personalities are rarely studied. Similarly, research on       can be used to adjust the diversity degrees for people with dif-
personality-based recommender systems has primarily focused on           ferent personalities in music recommender systems. Our research
the ‘cold-start problem’; few previous works have investigated how       questions are therefore:
personality influences users’ diversity needs. This paper combines            • RQ1: Is there an underlying relationship between people’s per-
these two branches of research together: re-ranking for diversifica-            sonality and their needs for recommendation diversity in Music
tion, and improving accuracy using personality traits. Anchored                 domain?
in the music domain, we investigate how personality information               • RQ2: What is the effect (on diversity and accuracy) of adjusting
can be used to adjust the diversity degrees for people with different           the diversity degrees in Music Recommender Systems based on
personalities. We proposed a personality-based diversification algo-            users’ personality information?
rithm to help enhance the diversity adjusting strategy according to         First, to address RQ1 we conduct a pilot user study to investi-
people’s personality information in music recommendations. Our           gate whether there exists a relationship between users’ personality
offline and online evaluation results demonstrate that our proposed      information and their diversity needs on music preference (Sec-
method is an effective solution to generate personalized recommen-       tion 5.1). A relation model is built based on the pilot study results
dation lists that not only have relatively higher diversity as well as   (Section 5.2). To address RQ2, we proposed a personality-based
accuracy, but which also lead to increased user satisfaction.            diversification algorithm referred to this relation model (Section 4).
                                                                         Our proposed diversification method adjusts the diversity degrees
CCS CONCEPTS                                                             adaptively in music recommendations according to users’ distinct
• Information systems → Recommender systems;                             personality information. Both offline (Section 5.3) and online stud-
                                                                         ies (Section 5.5) evaluate the efficiency and effectiveness of our
KEYWORDS                                                                 proposed algorithm. We conclude with limitations and suggestions
                                                                         for future work in Sections 7 and 8.
Recommender Systems, Diversity, Personality, Music Recommen-
dation, Re-ranking                                                       2     RELATED WORK
                                                                         Existing research on personality-based and diversity-based Recom-
                                                                         mender Systems are mostly separated [7]. We discuss first research
1   INTRODUCTION                                                         on diversity-based recommender systems, and then related research
As recommender systems have moved beyond accuracy for evalua-            on personality-based recommender systems.
tion, metrics such as diversity and novelty have been proposed to
evaluate the quality of recommender systems [20]. Such research          2.1    Diversity-based Recommender Systems
[11, 18] also help to address the ‘filter bubble’ problem [21]. Re-      Many current diversity-oriented recommender systems [9, 31, 33]
search on the diversity metric have contributed to the emergence of      adopt a fixed strategy to adjust the diversity degree for all users,
diversity-based recommender systems, which endeavor to achieve           in which they usually pre-defined a score function balancing the
an optimal balance between accuracy and diversity [25]. In addition,     diversity and accuracy with a parameter λ and re-ranked the gen-
researchers suggest that there exists a connection between people’s      erated recommendation list according to the calculated scores. For
stable personality traits and their tastes and preferences [24]. The     instance, Ziegler et al. [33] proposed the topic diversification ap-
idea of personality-based recommender systems is thus proposed.          proach towards balancing recommendation lists, which is a heuris-
   However, current research on personality and diversity based          tic algorithm based on taxonomy similarity to increase the rec-
recommender systems are mostly separated [7]. In many diversity-         ommendation diversity. To balance the accuracy of suggestions
based recommender systems [9, 31, 33], researchers usually set a         and the user’s extent of interest in specific topics, they defined a
fixed balance degree between accuracy and diversity for all users.       weighting parameter to control the impact of two ranking lists, one
Adjusting to diversity needs for users with different personalities      ranking the items that are similar to user’s attribute-based prefer-
are rarely studied. Similarly, research on personality-based recom-      ence and the other ranking the items in reverse. Vargas et al. [31]
mender systems [14, 28] has utilized personality information to          also proposed a similar re-ranking diversification method based
improve the calculation of user similarity in recommendations so         on sub-profiles of users, in which they also used a parameter λ to
IntRS Workshop, October 2018, Vancouver, Canada                                                                    Feng Lu and Nava Tintarev


control the balance between the initial ranking score and diversity       recommendations. However, our research are dissimilar w.r.t. do-
score. Building on this work, Di Noia et al adjusted the diversity        main difference and the algorithms that were applied.
function adaptively according to users’ diversity inclination [9],
which is calculated as Entropy from user preferences. The balancing       3     DIVERSITY AND PERSONALITY IN
parameter λ in their objective function is still fixed for all users,           RECOMMENDATIONS
which means that although their diversity function might be ad-
                                                                          Before we proceed into our research steps, we first explain two key
justed properly, the recommendation balance between similarity
                                                                          concepts as they are defined in our paper: diversity and personality.
and diversity is still the same for all users. All these work success-
                                                                          In this section, we mainly focus on the diversity metric, the person-
fully increased the diversity in recommendations, while they rarely
                                                                          ality model and the corresponding extraction methods we applied
consider that different users with different personalities may have
                                                                          in our research.
different diversity needs.

2.2     Personality-based Recommender Systems                             3.1    Diversity
Recent works have explored the relationship between personality           Diversity is usually considered as the inverse of similarity, which
traits and user preferences in recommender systems [15, 28, 29].          refers to recommending a diverse set of items that are different from
Studies also show that personalities influence human decision mak-        each other to users [20]. This concept has been introduced into the
ing process and interests for music and movies [6, 13, 24], which         field of recommender systems as one of the possible solutions to
implies that personality information should be considered if we           address the over-fitting problem and to increase users’ satisfaction.
want to deliver personalized recommendations. While, in another               In this paper, we focus on Intra-List Diversity (ILD). Research that
aspects, since users’ attitudes towards new or diverse experiences        is focused on the definition and evaluation of the Intra-List Diversity
vary considerably [26], personality can also be considered as a           starts with Bradley and Smyth [3], who define the diversity as the
key aspect when incorporate novelty and diversity into recommen-          averaged pairwise distance (dissimilarity) between all items in the
dations, which means that the degree of diversity in presenting           recommendation set, which can be calculated as follows:
                                                                                                   Ín Ín
                                                                                                    i=1 j=i (1 − Similarity(c i , c j ))
recommended items can also be personalized.
   In contrast, most of the research work [13, 15] in personality-                      ILD(R) =                                               (1)
                                                                                                             n ∗ (n − 1)/2
based recommender systems is designed to improve the user sim-
ilarity calculation in recommendations to address the cold-start          where c 1 ..c n are items in a set of recommendation list and R is the
problem. Diversity degrees are usually the same for all users. They       recommended list. Other metrics are also raised, such as Vargas et
rarely consider that different users might also possess different atti-   al.’s ILD metric [30] or Gini-coefficient measurement [10]. In this
tudes towards the diversity of items, which means that personality        work we mainly refer to Equation 1 for the diversity metric.
information can also be useful when adjusting diversity degrees
in Recommender Systems. As some recent studies have already               3.2    Personality
shown that personality can affect people’s needs for diversity de-        Personality represents people’s differences in their enduring emo-
grees for items either in movie recommendations [6, 32] or book           tional, interpersonal, experiential, attitudinal and motivational styles
recommendations [26], people with different personalities may also        [17]. For the last few decades, a number of personality models
need recommendations with different diversity degrees for music           and acquisition methods have been proposed. In this research, we
recommendations.                                                          mainly focus on the Big-Five Factor Model and the explicit acquisi-
                                                                          tion methods (specifically, Ten-Item Personality Inventory).
2.3     Addressed Research Gap
This paper addresses the gaps in the two research branches, by com-          Personality Model. We adopted one of the most commonly used
bining them. The main contributions of our research are threefold:        personality model called the Big-Five Factor Model (FFM) [19],
                                                                          which defines personality as five factors: Openness to Experience
      • We investigated the relation between users’ personality fac-
                                                                          (O), Conscientiousness (C), Extroversion (E), Agreeableness (A), and
        tors and their diversity needs on music preference and found
                                                                          Neuroticism (N). Usability of this model in recommender systems
        that there exist certain positive correlations between these
                                                                          can be found in Recommender Systems Handbook [27].
        two factors.
      • We proposed a personality-based re-ranking diversification           Extraction Method. Current acquisition methods for personality
        algorithm, which can adaptively set different diversity levels    can be classified into two groups: explicit methods (using ques-
        for user based on their personalities in music recommenda-        tionnaires) and implicit methods (extract personality from social
        tions.                                                            networks). We used the explicit method considering that it is more
      • We evaluated this strategy in both online and offline studies,    accurate than the implicit methods [27]. Specifically, we adopted a
        which suggest that this approach is effective for improving       short personality test called Ten Item Personality Inventory (TIPI)
        both diversity, accuracy, and also user satisfaction.             [12] since it needs less time for users to finish. In TIPI, each person-
   To the best of our knowledge, in the music domain, we are the          ality factor of FFM is assessed by two questions. For instance, ex-
first to conduct such systematic user study on the correlation be-        traversion is assessed by ‘Extraverted, enthusiastic’ and ‘Reserved,
tween personality and users’ diversity needs. In the movie domain,        quiet’. Each question (ten in total) can be rated from 1 to 7, which
Wu and Chen et al. [6, 32] conducted a similar research on movie          can be then mapped into five personality factor scores. The scores
IntRS Workshop, October 2018, Vancouver, Canada                                                                         Feng Lu and Nava Tintarev

                                                                           Algorithm 1 The Diversification Algorithm to generate the re-
                                                                           ranked list R from the original list O
                                                                           Input: (Original Recommendation List O (length: 5N), target list
                                                                               size N, personality-related parameters λ, θ 1 , θ 2 , ..., θ n )
                                                                           Output: Top-N re-ranked list R
                                                                            1: R(1) ⇐ O(1)
                                                                            2: while |R| < N: do
                                                                                  Divover all (c, R) = i=1,2, ..,n θ i ∗ Divi (c, R)
                                                                                                      Í
                                                                            3:
                                                                            4:    c ∗ = argminc ∈O \R Obj(c, R) = Sim(c, P) ∗ (1 − λ) + λ ∗
                                                                                  Divover all (c, R)
                                                                            5:    R = R ∪ {c ∗ }
                                                                            6:    O = O \ {c ∗ }
                                                                            7: end while
                                                                            8: return R


                     Figure 1: Research Steps
                                                                           be much larger than the final re-ranked list (with N items). In our
                                                                           algorithm, we use 5N items for the input list.
                                                                              The balancing parameter λ in the objective function (line 4, Al-
of each factor can be further mapped into four different personality       gorithm 1) is controlled by personality factors in our algorithm.
levels: Low, Medium Low, Medium High, and High [12].                       To adjust the diversity degrees more flexibly, we also introduce
                                                                           parameters θ 1 , θ 2 , .., θ n to control the computation of the overall
4   PERSONALITY-BASED RE-RANKING                                           diversity. All of these two kinds of parameters (λ, θ 1 , θ 2 ,..., θ n ) are
                                                                           affected by the personality factors.
    DIVERSIFICATION
In this section, we discuss the core of our work: the personality-         4.1    Objective Function
based diversification algorithm, which consists of a) an objective
                                                                           The core of the algorithm lies in the re-ranking objective function in
function, and b) personality related parameters.
                                                                           line 4 (Algorithm 1), which is referred from the Maximal Marginal
    In a later section we will describe the pilot user study in which we
                                                                           Relevance (MMR) [4]:
identified the relationship between users’ personality information
and their diversity needs on music preference (Section 5.1). The
results of that pilot study inform the parameters for both a) and b)               Obj(c, R) = Sim(c, P) ∗ (1 − λ) + λ ∗ Divover all (c, R)         (2)
above. Figure 1 outlines our overall research methodology, including
                                                                              The left part of the function Sim(c, P) considers the similarity
offline and online studies to evaluate our algorithm. .
                                                                           aspect of the item c to users’ initial interests P. In our work, we
    Normally, the recommendation process of a recommender sys-
                                                                           computed the similarity values as the rank of item c in the final list
tem can be divided into two steps: first the system generates the
                                                                           according to their predicted ratings sorted in the descending order.
predicted values for all unrated items for each user and secondly
                                                                           We did not use the predicted ratings directly considering that such
these items are sorted in descending order according to their pre-
                                                                           predicted values may not be available for all recommender systems
dicted values. While in order to improve the diversity degrees of
                                                                           (e.g. Spotify). Thus, our Sim(c, P) function becomes:
the recommendations, we use re-ranking as an improvement to
the second step. We borrow the idea of the Topic Diversification                                    Sim(c, P) = Rank(c, O)                          (3)
method presented in Ziegler et al.’s work [33]. Specifically, greedy       where Rank(c, O) represents the rank of item c in the original rec-
heuristics are used in our work, which have been demonstrated to           ommendation list O generated by some recommendation algorithm.
be efficient and effective [9, 33]. The diversification algorithm is          The other part of the function Divover all (c, R) defines the over-
shown in Algorithm 1.                                                      all diversity degree of the item c compared with the items so far
    This greedy algorithm will iteratively select an item from the         selected in the re-ranked list R. Here, we define the overall diver-
original list O (generated directly from a recommender system) and         sity as the weighted combination of several diversity degrees for
then puts it at the end of the current re-ranked list R until the size     different attributes (e.g. track attributes like artists, genres in music
of R meets a size N (N=10 in our case) and the re-ranking process          recommendation). As shown in line 3 (Algorithm 1), the diversity
is complete. The core of the algorithm lies in the objective function      function is defined as follows:
(line 4, Algorithm 1) which controls the balance between similarity                                              Õ
and diversity, so that at each re-ranking step, the algorithm can                       Divover all (c, R) =             θ i ∗ Divi (c, R)       (4)
pick the next item that minimizes the objective function as the next                                          i=1,2, ...,n
item to be placed at the end of the current diversified re-ranked list.    where n represents the total number of attributes we used for
The target list is a re-ranked list with N top-ranked items (called        computing the overall diversity Divover all (c, R), θ i represents the
Top-N items). In order to perform the re-ranking algorithm to make         weight for each attribute diversity degree. Divi (c, R) represents the
the re-ranked list diverse enough, the size of the input list should       different diversity degrees for different attributes, which is defined
IntRS Workshop, October 2018, Vancouver, Canada                                                                                                  Feng Lu and Nava Tintarev

Table 1: Mapping from Personality Factor Level to Personal-                            Table 2: Demographic profiles of the pilot study (numbers
ity Related Parameters                                                                 in the bracket stand for the total number of users).

 Personality Factor Level           Low     Medium Low           Medium High   High     Age               ≤20 (5); 21-30 (83); 31-40 (32); 41-50 (18); 51-60 (5); ≥ 60 (5)
                                                                                        Gender            Male (96); Female (47); Not tell (5)
        λ/θ 1 /θ 2 /.../θ n          0.2              0.4                0.6   0.8
                                                                                        Nationality       Asia (53); Europe (38); South America (42); North America (12); Africa (3)
                                                                                        Education Level   Graduate School (83); College (45); High School (20); Others(2)



as ILD (equation 1). In our experiment, we used three attributes                       5.1       Pilot study
(n=3) that are closely correlated with the personality factors found
                                                                                       To address our first research question, we conducted a pilot study, in
in our pilot user study, which we will introduce later in Section 5.2.
                                                                                       which we collected users’ personality information and their music
   The function of the control parameter λ will be explained in
                                                                                       preferences (preferred songs). We designed a website 1 for the user
Section 4.2 and at the end of Section 5.2.
                                                                                       survey. The survey contains four main parts:
                                                                                           • User’s basic information: Collecting users’ demographic
4.2    Personality Related Parameters                                                         information such as their age range and gender.
For now, we have defined our similarity function and diversity func-                       • Personality test: The personality test in our pilot study is
tion. But we still have not incorporated the personality information.                         conducted via the TIPI, in which users need to answer ten
In our algorithm, the influence of the personality factors is exerted                         self-assessment questions. Each question should be rated
on the parameters (λ, θ 1 , θ 2 ,..., θ n ) in our objective function.                        from 1 to 7, from ‘Disagree strongly’ to ‘Agree strongly’ (e.g.,
   Parameter λ affects the balance between similarity and diversity                           I see myself as extraverted, enthusiastic).
directly, thus it controls the degree of overall diversity needs. Pa-                      • Music preference collection: Users’ music preference is
rameters θ 1 , θ 2 ,..., θ n control the specific attribute diversity degrees                 collected by means of Spotify Web API, with which users
accordingly. As mentioned in Section 3, each personality factor can                           are asked to provide at least 20 preferred songs that they
be divided into four levels: Low, Medium Low, Medium High,                                    normally listen to and can best describe their music taste.
and High. For each possible correlation between personality fac-                              Users are also asked to rate their selected songs from 1 to 5
tors and overall/attribute diversity needs, we define their mapping                           (least preferred to most preferred).
function as follows in Table 1.                                                            • User comments: A free-text comment section is included.
   For θ 1 , θ 2 ,..., θ n , we take one more computation step: normaliza-
tion. Thus, the final θ 1 /θ 2 /.../θ n are computed as follows:                       5.2       Pilot Study Results
                                                                                       We spread the survey via two channels: Crowdsourcing platforms
                                    θi
                      θi = Í                      ,    i = 1, 2, .., n           (5)   and students at several universities (e.g., TU Delft, Netherlands;
                               j=1,2, ...,n θ j                                        EPFL, Switzerland; and Lanzhou University, China). The majority
                                                                                       (around 80%) of the participants are recruited from Crowdflower
   Noted that, in order to conduct the mapping, we need to know                        (now called Figure Eight) 2 . To ensure the quality of the data col-
the correlation between each personality factor and users’ over-                       lected, we also inserted some test questions into the survey to help
all/attribute diversity needs beforehand. Parameter λ is decided by                    us filter suspicious responses. On the Crowdflower platform, work-
the personality factor that has a positive correlation with the overall                ers also need to submit their contributor ids and verification codes
diversity needs (e.g. in our case, it is Emotional Stability). While                   which are displayed at the end of the survey. These verification
parameters θ 1 , θ 2 ,..., θ n are decided by the personality factors that             methods helped us remove a number of irresponsible participants,
are correlated with the attribute diversity needs. The specific corre-                 especially from the Crowdsourcing platform. Results for the user
sponding personality factors for each parameter for the mappings                       survey are shown below.
will be shown in Section 5.2.
                                                                                          Participants. 148 participants were recruited to participate in
                                                                                       the survey, the demographic properties of these participants are
5     EXPERIMENT                                                                       shown in Table 2.
Following our research steps in Figure 1, we first conducted a pilot
study to explore the possible correlation between users’ personality                      Relation between Personality Factors and Single Attribute
factors and the diversity needs on their music preferences. Our                        Diversity of Music Preference. When studying the correlation
diversity adjusting strategy (in Section 4) is thus based on the find-                 between personality factors and each attribute’s diversity degrees,
ings in the pilot study. To evaluate the efficiency and effectiveness                  we first calculated the personality scores for each user from the TIPI
of our proposed personality-based diversification algorithm, we                        question scores. Then, we computed the diversity scores for each
conducted both offline and online evaluation. For the page limita-                     attribute within the list of tracks a user has selected using the ILD
tion, we will discuss our pilot study and offline evaluation briefly.                  (Equation 1) metric. For each track, we have chosen six attributes to
Results for both pilot study and offline evaluation will be shown in                   compute specific diversity degrees: Release Times, Artists, Number of
this section. We will show the results for the online evaluation in                    1 Available at: https://music-rs-personality.herokuapp.com
the next section.                                                                      2 Crowdflower: https://www.figure-eight.com
IntRS Workshop, October 2018, Vancouver, Canada                                                                                      Feng Lu and Nava Tintarev

Table 3: Spearman Correlation coefficient between per-                              Table 4: Spearman Correlation coefficient between person-
sonality factors/demographic values and diversity degrees                           ality factors/demographic values and overall diversity (**p-
w.r.t. single attribute (*p-value<0.05 and **p-value<0.01). E:                      value<0.01). E: Extraversion, A: Agreeableness, C: Conscien-
Extraversion, A: Agreeableness, C: Conscientiousness, ES:                           tiousness, ES: Emotional Stability, O: Openness.
Emotional Stability, O: Openness.
                                                                                                             E       A       C      ES       O     Gender     Age
                       E        A        C       ES       O       Gender   Age
                                                                                     Overall_Div1          0.11     0.09    0.08   0.31**   0.03    0.01      -0.05
 Div(Release times)    -0.03    -0.12    0.01    0.11     -0.15   0.00     0.28**
 Div(Artists)          0.10     0.09     0.11    0.22**   -0.04   -0.03    -0.16
                                                                                     Overall_Div2          0.11     0.08    0.08   0.28**   0.01    0.00      -0.10
 Div(Artists number)   0.00     0.25**   0.13    0.15     0.07    0.06     -0.14     Overall_Div3          0.12     0.06    0.07   0.29**   0.02    0.00      -0.09
 Div(Genres)           0.07     0.00     -0.01   0.25**   0.06    0.06     0.03
 Div(Tempo)            0.11     0.09     0.11    0.24**   0.08    -0.17*   -0.02
                                                                                    5.3      Offline Evaluation
 Div(Key)              0.21**   0.05     0.06    0.17*    0.08    -0.13    -0.10
                                                                                    Since our diversification algorithm is built upon a re-ranking algo-
                                                                                    rithm (a diversification method by re-ordering the recommendation
Artists, Genres, and two audio features (Tempo and Key). Spearman’s                 list), its final diversity degree is affected by some re-ranking related
rank correlation coefficient was used to calculate the correlation                  parameters such as the size of the final top-N re-ranked list (N). The
between the five personality factors and the diversity scores for each              personality related parameters (λ, θ 1 , θ 2 , θ 3 , see Section 4.2) will
attribute. In addition, considering that some demographic values                    also greatly influence the final diversity degrees of the recommen-
might also have some impact on the diversity needs for users when                   dation lists. Thus, we have conducted a series of offline evaluations
delivering recommendations, we also included two demographic                        to test the influence of different parameters. The parameters we
values (age and gender) in the correlation comparison. Results are                  tested are:
shown in Table 3.                                                                         • The size of the final top-N re-ranked list (N).
   Relation between Personality Factors and Overall Diversity.                            • The size of the input list (LS).
Besides studying the correlation between the personality factors                          • The size of the unrated items used for recommendation (K).
and diversity scores for single attribute, we also computed the cor-                      • Personality related parameters λ.
relation between the overall diversity and user’s personality values.                   In order to generate initial recommendations with high quality,
Considering that different users usually place different weights                    we used a state-of-the-art recommendation algorithm called Fac-
on attributes (e.g. some user may consider that the diversification                 torization Machine (specifically, fastFM [1]) [23]. To train the FM
of Artists is the most important), we assigned three different sets                 sufficiently, we combined our pilot study dataset (148 users’ data in
of weights to the six attributes (Release Times, Artists, Number of                 Section 5.2) with a complementary dataset with much larger user
Artists, Genres, Tempo, Key) in reference to [16]: Overall_Div1:                    data: The Echo Nest Taste Profile Subset (TPS) 3 [2]. We made a
‘Equal weights method’ (1/6, 1/6, 1/6, 1/6, 1/6, 1/6); Overall_Div2:                few data selection beforehand. We first ruled out those tracks that
‘Rank-order centroid (ROC) weights’ (0.41, 0.24, 0.16, 0.10, 0.06,                  have only been listened to once. Then we ruled out those users
0.03); Overall_Div3: ‘Rank-sum (RS) weights’ (0.29, 0.24, 0.19, 0.14,               who listened to fewer than 100 tracks in total. The TPS dataset only
0.09, 0.05).                                                                        contains track play counts. We further mapped the play counts
   From Table 3 and 4, we concluded four important correlations.                    into the integer ratings (1-5) using the rating mapping algorithm
For single attribute diversity, we find:                                            mentioned in [5].
                                                                                        We then first split our pilot study dataset into two subsets: train-
     • C1. Personality factor Extraversion has a positive correlation
                                                                                    ing Set M 1 and testing set T. T contains the top-5 rated tracks
        with the diversity degree of Key.
                                                                                    (ratings all ≥ 4) for each user, which we will consider as the rele-
                                                                                    vant items to each user. The remaining user data of the pilot study
     • C2. Personality factor Agreeableness has a positive correla-
                                                                                    dataset (M 1 ) is combined with the TPS subset (M 2 ) to form our
       tion with the diversity degree of Artists Number.
                                                                                    whole training set M. After training the FM, we used this FM to
                                                                                    generate recommendations for users in the testing set T.
    • C3. Personality factor Emotional Stability has a positive cor-
       relation with the diversity degrees of Artist, Genre and Tempo.                 Hit Rate. The first metric we used in our offline evaluation was
   We also find that: C4. Personality factor Emotional Stability has                an accuracy measure. Hit rate was chosen due to the large item
a positive correlation with the overall diversity degree.                           count (number of distinct tracks) and the small number of listening
                                                                                    history per user [8]. Instead of using all unseen items (all items
   These correlations can then be used to map the parameters in                     not used for training for each user) for prediction and counting the
our diversification algorithm (c.f., Section 4.2). Specifically, λ is ad-           number of ‘hits’ (relevant items) in the top-N list, in our testing
justed according user’s Emotional Stability level. We used three                    method, each relevant item (known top-5 rated relevant items for
attribute diversity in the later experiment (see Section 5.5): Genre,               each user) in the Testing Set is evaluated separately by combining
Artists Number, and Key. Thus, θ 1 , θ 2 , and θ 3 are adjusted according           it with K (we used K=100) other items that this user has not rated.
to user’s Emotional Stability, Agreeableness and Extraversion                       3 The Echo Nest Taste profile subset: http://labrosa.ee.columbia.edu/millionsong/
respectively.                                                                       tasteprofile, extracted in July, 2018
IntRS Workshop, October 2018, Vancouver, Canada                                                                        Feng Lu and Nava Tintarev

Table 5: Comparison of the two lists on the accuracy (Hit
rate) and diversity (ILD) for N=10, LS=50, K=100.


                               Initial List       Re-ranked List
        Hit rate@10                0.043                  0.141
           ILD@10                  0.390                  0.483


We assume that these unrated items will not be of interest to user        Figure 2: Example of the two recommendation lists shown
u, representing the irrelevant items. The task of the FM is then to       to users. One is the initial list and the other is the re-ranked
rank these K+1 items for each user. For each user, we generate the        list (in random order). Users can click on the button to have
two recommendation lists: initial recommendation list (top-N items        a preview for each track. Users also need to choose whether
from the initial list generated by FM) and our re-ranked list. We         they like the track or not. The first two tracks are shown in
then check whether this item is in the two lists. If in, we consider      this figure. In total, there are ten tracks in each list.
it as hit, if not, we consider it as miss. This process is repeated
for each item in the Testing Set. The final hit rate is computed as:
H (N ) = #hit/|T |.                                                       advance. In our online evaluation, we used Spotify Recommenda-
                                                                          tion System based on their open Web APIs 5 in order to provide
  ILD. We also compare the diversity degrees for both recommen-           real-time recommendations. User interests are represented as ‘seed
dation lists using intra-list diversity.                                  information’ in Spotify Recommendation. Three kinds of seed infor-
                                                                          mation are used: artists, tracks, and genres. Spotify has a restriction
5.4     Offline Evaluation Results                                        on the total number of input seeds, which is maximally 5. To ensure
Our offline evaluation results show that, for both N and LS (K is         that the originally generated recommendation list (which has 100
fixed to 100), the hit rate for both lists will increase when N and LS    tracks) is already diverse enough, we use at least 1 artist seed, 1
increases (hit rate for our re-ranked list is always higher than the      track seed, and 1 genre seed for every recommendation.
initial list). Diversity degrees also increase when we increase the
                                                                             5.5.2 Independent Variables. After we obtain the two mate-
two parameters (ILD for our re-ranked list is always higher). For
                                                                          rials from users, we then generate the recommendations for them.
parameter K, results show that both hit rate and ILD drop when
                                                                          In the evaluation, similar to offline evaluation, we generate two
we increase K. For the personality related parameter λ, we find
                                                                          recommendation lists (initial list and re-ranked list) for each user,
that both the hit rate and ILD values will increase when we keep
                                                                          each list contains 10 tracks. We adopted a within-subjects experi-
increasing λ.
                                                                          mental design where the two recommendation lists are displayed
    After separately evaluating the influence of these parameters, we
                                                                          to the users at the same time (see Figure 2). Thus, the indepen-
then made a final comparison on the two lists. Results are shown
                                                                          dent variables here are the two recommendation lists. The order of
in Table 5. We see that our re-ranked list outperforms the initial
                                                                          presentation was balanced between participants.
list both in hit rate and ILD.
                                                                             5.5.3   Dependent Variables.
5.5     Online Evaluation                                                    Precision@10. In order to directly measure the precision of the
Considering that offline evaluation metrics cannot always reflect the     recommendations, we ask the users to rate each track as ‘Like’ or
actual user satisfaction for recommendations in real life. To further     ‘Dislike’. Tracks rated as ‘Like’ are considered as relevant items. The
evaluate whether our personality-based diversification algorithm          Precision@10 for each list is computed as proportion of relevant
can really enhance user satisfaction and users’ perception of list        items in the whole list.
diversity, we therefore conducted the following online evaluation.
   Similar to our pilot study (in Section 5.1), we also constructed a       Diversity. For both lists, we also used ILD (Equation 1) to com-
website 4 for the evaluation.                                             pute the diversity degrees.

   5.5.1 Materials. Two materials are needed from the users be-              User Feedback. In addition to calculating the precision and ILD
forehand: the Personality Profile and the User Interests.                 for each recommendation list, we also ask user for some feedback
                                                                          on the two lists via a post-task questionnaire. Each user needs to
   Personality Profile. We still adopted the Big-Five Factor Model as     express their opinions on both lists in terms of the following three
the basic personality model in our system. Ten Items Personality          main aspects:
Inventory (TIPI) is also used to extract these five personality factors        • Recommendation Quality (Q1 & Q2): “The items in List A/B
from users.                                                                      recommended to me matched my interests.”
  User Interests & Recommendation. To generate the initial recom-              • Recommendation Diversity (Q4 & Q5): “The items in List A/B
mendation list, we request users to offer their music interests in               recommended to me are diverse.”
                                                                          5 Spotify Recommendation: https://developer.spotify.com/documentation/web-api/
4 Available at https://music-rs-personality-online.herokuapp.com
                                                                          reference/browse/get-recommendations/
IntRS Workshop, October 2018, Vancouver, Canada                                                                          Feng Lu and Nava Tintarev


      • User Satisfaction (Q7 & Q8): “Overall, I am satisfied with the      Table 6: Demographic profiles of 25 participants for the on-
        Recommendation List A/B”                                            line evaluation.
   All of these questions are referred to the ResQue User-Centric
Evaluation Framework [22], which are are responded on a 5-point                     Gender      Male (13); Female (8); Prefer Not to Answer (4)
Likert scale, from 1 to 5, meaning from "Disagree strongly" to "Agree                 Age                         21-30 (25)
strongly". We then compute and compare the average ratings for                     Education           College (4); Graduate School (21)
each question on both lists. Considering that users may give the
same ratings for both lists, we added two more sub-questions regard-
                                                                            Table 7: Precision@10 and ILD@10 for the two lists. Pair-
ing the Recommendation Quality and Recommendation Diversity:
                                                                            wise t-tests significant at p < 0.05.
      • Recommendation Quality (Q3): “Which Recommendation List
        is more interesting to you (match more of your interests)?”
                                                                                                      Initial List L 1    Re-ranked List L 2
      • Recommendation Diversity (Q6): “Which Recommendation
        List is more diverse to you?”                                              Precision@10       0.58 (std: 0.15)      0.668 (std: 0.14)
These two questions rated with categorical answers: “List A”, “List                   ILD@10          0.48 (std: 0.06)      0.57 (std: 0.07)
B”, or “Hard to tell”.
    5.5.4 Procedure Design. Similar to our pilot study, four main
parts are included in the website: The user basic information, per-
sonality test, recommendation and feedback, and user comment.
The user basic information, personality test, and the last user com-
ment parts are similar. For the recommendation and feedback part,
we provide two channels to obtain users’ original interest: a) uti-
lize Spotify history; or b) Type in manually. If users choose to use
their Spotify listening history, we will use two of their top-played
artists, two of their top-played tracks, and the top-played genre for
generating the recommendations. Users can alternatively choose
to type in their interests manually. In this way, we request users to
type in at least one artist seed, one track seed, and one genre seed.
    After we obtain users’ music preference, we then feed these seeds
into the Spotify recommendation system to generate the initial
recommendation list (100 tracks). The first list L 1 is constructed         Figure 3: Full comparison for Recommendation Quality (Ac-
by directly taking the top-10 items from the initial list. The second       curacy), Diversity and User Satisfaction. Student t-Test is
list L 2 is generated based on our personality-based diversification        also used. p < 0.05.
algorithm. We select the top-50 tracks as the input list for re-ranking.
To minimize any carryover effects, we show these two lists in
random order to users (displayed as List A and List B). For each               Recommendation Quality. Specifically, for recommendation
track, users can click on the play button to listen to a 30 seconds’        quality (Q1 and Q2), the average ratings for the two lists are 3.4
preview. The track name and the corresponding artist name are               (initial list, std=0.98) and 4.12 (re-ranked list, std=0.65) (t=-3.00,
also shown in the list. For each track, users need to rate as ‘Like’ or     p=0.004). Q3 further compares the recommendation quality of the
‘Dislike’ for both lists. After rating all the 20 tracks, users are asked   two lists with categorical answers. Results show that 8.0% users
to fill in the feedback questionnaire (see Section 5.5.3).                  think the Initial List is better in matching their interests, 52.0% users
                                                                            think the re-ranked list is better, other 42.0% users think it is hard
6     ONLINE EVALUATION RESULTS                                             to tell (for Chi-Squared Test, statistic=7.76, p < 0.05).
To evaluate users’ actual satisfaction towards our personality-based           Recommendation Diversity. Table 7 shows the Precision@10
diversification method, we conducted this online evaluation.                and ILD@10 results for both lists.
                                                                               For perceived recommendation diversity (Q4 & Q5), the average
6.1     Participants                                                        ratings for the two lists are 3.28 (initial list, std=0.96) and 3.92 (re-
We conducted our online evaluation with 25 participants recruited           ranked list, std=0.89) (t=–2.39, p=0.02). Q6 further compares the
at a university. Participants’ ages ranged from 21-30 years old. Table      recommendation diversity of the two lists with categorical answers.
6 summarizes their demographics.                                            Results show that 16.0% users think the initial List is better in
                                                                            matching their interests, 48.0% users think the re-ranked list is
6.2     Feedback Questions                                                  better, other 36.0% users think it is hard to tell (for Chi-Squared
                                                                            Test, statistic=3.92, p=0.14).
Figure 3 shows the comparison of the two lists on three aspects.
We used a paired t-test for questions on a 5-point Likert scale (Q1,           User Satisfaction. For user satisfaction (Q7 & Q8), the average
Q2, Q4, Q5, Q7, and Q8). And we applied Chi-Squared Test for the            ratings for the two lists are 3.36 (initial list, std=0.93) and 3.92
questions with categorical answers (Q3 and Q6).                             (re-ranked list, std=0.97) (t=-2.03, p < 0.05).
IntRS Workshop, October 2018, Vancouver, Canada                                                                                                Feng Lu and Nava Tintarev


7    DISCUSSION AND LIMITATION                                                               [6] Li Chen, Wen Wu, and Liang He. 2013. How personality influences users’ needs
                                                                                                 for recommendation diversity?. In CHI’13 Extended Abstracts on Human Factors
From the online evaluation results, we see that our re-ranked rec-                               in Computing Systems. ACM, 829–834.
ommendation list outperforms the initial recommendation list in all                          [7] Li Chen, Wen Wu, and Liang He. 2016. Personality and Recommendation Diver-
                                                                                                 sity. In Emotions and Personality in Personalized Services. Springer, 201–225.
three aspects (recommendation quality, diversity, and user satisfac-                         [8] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of
tion). For the two categorical questions Q3 and Q6, results for Q3 is                            recommender algorithms on top-n recommendation tasks. In Proceedings of the
in line with the results shown in Figure 3. While for Q6, the p-value                            fourth ACM conference on Recommender systems. ACM, 39–46.
                                                                                             [9] Tommaso Di Noia, Vito Claudio Ostuni, Jessica Rosati, and Paolo Tomeo. 2014. An
for Chi-Square Test is larger than 0.05, which means that there is                               analysis of users’ propensity toward diversity in recommendations. In Proceedings
no significant difference for Q6 when we asked users which list is                               of the 8th ACM Conference on Recommender systems. ACM, 285–288.
more diverse to them. The reason behind this phenomenon may lies                            [10] Daniel M Fleder and Kartik Hosanagar. 2007. Recommender systems and their
                                                                                                 impact on sales diversity. In Proceedings of the 8th ACM conference on Electronic
in our limited sample size. The precision of the two lists has no big                            commerce. ACM, 192–199.
difference (around one relevant track difference). While considering                        [11] Ishan Ghanmode and Nava Tintarev. 2018. MovieTweeters: An Interactive Inter-
                                                                                                 face to Improve Recommendation Novelty. In IntRS@ RecSys.
that our algorithm has raised the diversity level of the recommen-                          [12] Samuel D Gosling, Peter J Rentfrow, and William B Swann. 2003. A very brief
dation at the same time, we still can say that the re-ranked list is                             measure of the Big-Five personality domains. Journal of Research in personality
better in users’ perspective and our personality-based diversifica-                              37, 6 (2003), 504–528.
                                                                                            [13] Rong Hu and Pearl Pu. 2010. A study on user perception of personality-based
tion algorithm has enhanced the diversity adjusting strategy in                                  recommender systems. User Modeling, Adaptation, and Personalization (2010),
music recommendations.                                                                           291–302.
   One limitation of our research lies in the limited sample size                           [14] Rong Hu and Pearl Pu. 2010. Using personality information in collaborative
                                                                                                 filtering for new users. Recommender Systems and the Social Web 17 (2010).
both in pilot study and online evaluation. If more participants are                         [15] Rong Hu and Pearl Pu. 2011. Enhancing collaborative filtering systems with per-
recruited in our pilot study, the correlation between personality                                sonality information. In Proceedings of the fifth ACM conference on Recommender
                                                                                                 systems. ACM, 197–204.
factors and diversity needs may be stronger. Similarly, more users                          [16] Jianmin Jia, Gregory W Fischer, and James S Dyer. 1998. Attribute weighting
included in our online evaluation might also yield better results.                               methods and decision quality in the presence of response error: a simulation
Later researchers are suggested to repeat our research with more                                 study. Journal of Behavioral Decision Making 11, 2 (1998), 85–105.
                                                                                            [17] Oliver P John and Sanjay Srivastava. 1999. The Big Five trait taxonomy: History,
participants. Another limitation lies in that we did not include more                            measurement, and theoretical perspectives. Handbook of personality: Theory and
features (e.g. more audio features like loudness) in our pilot study.                            research 2, 1999 (1999), 102–138.
                                                                                            [18] Jayachithra Kumar and Nava Tintarev. 2018. Using visualizations to encourage
                                                                                                 blind-spots exploration. In IntRS@ RecSys.
8    CONCLUSION                                                                             [19] Robert R McCrae and Oliver P John. 1992. An introduction to the five-factor
                                                                                                 model and its applications. Journal of personality 60, 2 (1992), 175–215.
In this paper, we proposed a solution to address the research gap                           [20] Sean M McNee, John Riedl, and Joseph A Konstan. 2006. Being accurate is
between research in diversity-based recommender systems and                                      not enough: how accuracy metrics have hurt recommender systems. In CHI’06
personality-based recommender systems. We proposed an algo-                                      extended abstracts on Human factors in computing systems. ACM, 1097–1101.
                                                                                            [21] Eli Pariser. 2011. The filter bubble: What the Internet is hiding from you. Penguin
rithm to adjust the diversity degrees in music recommendations                                   UK.
adaptively for users with different personalities. The adjustment                           [22] Pearl Pu, Li Chen, and Rong Hu. 2011. A user-centric evaluation framework for
was based on a pilot user study which explored the relationship                                  recommender systems. In Proceedings of the fifth ACM conference on Recommender
                                                                                                 systems. ACM, 157–164.
between users’ personality factors and their diversity needs on                             [23] Steffen Rendle. 2010. Factorization machines. In Data Mining (ICDM), 2010 IEEE
music preferences. To assess the effectiveness of our algorithm, we                              10th International Conference on. IEEE, 995–1000.
                                                                                            [24] Peter J Rentfrow and Samuel D Gosling. 2003. The do re mi’s of everyday life: the
conducted both offline and online evaluations. Results suggest that                              structure and personality correlates of music preferences. Journal of personality
our diversification method not only increases the diversity degrees                              and social psychology 84, 6 (2003), 1236.
for recommendations, but it also gains more user satisfaction.                              [25] Barry Smyth and Paul McClave. 2001. Similarity vs. Diversity. In Proceedings of
                                                                                                 the 4th International Conference on Case-Based Reasoning: Case-Based Reasoning
   In future work, more (audio) features with a larger participant                               Research and Development (ICCBR ’01). Springer-Verlag, London, UK, 347–361.
pool will be studied. Instead of using the explicit personality test,                       [26] Nava Tintarev and Judith Masthoff. 2013. Adapting recommendation diversity to
we also plan to try implicit personality extraction method (e.g.                                 openness to experience: A study of human behaviour. In International Conference
                                                                                                 on User Modeling, Adaptation, and Personalization. Springer, 190–202.
via social media) in later work. Moreover, besides the re-ranking                           [27] Marko Tkalcic and Li Chen. 2015. Personality and Recommender Systems. Rec-
algorithm, we also plan to try different diversification strategies                              ommender Systems Handbook (Jan. 2015).
                                                                                            [28] Marko Tkalcic, Matevz Kunaver, Andrej Košir, and Jurij Tasic. 2011. Addressing
(e.g. optimization based diversification) with personality to check                              the new user problem with a personality based user similarity measure. In First
whether they would yield better results.                                                         International Workshop on Decision Making and Recommendation Acceptance
                                                                                                 Issues in Recommender Systems (DEMRA 2011). 106.
                                                                                            [29] Marko Tkalcic, Matevz Kunaver, Jurij Tasic, and Andrej Košir. 2009. Personal-
REFERENCES                                                                                       ity based user similarity measure for a collaborative recommender system. In
 [1] Immanuel Bayer. 2016. fastFM: A Library for Factorization Machines. Journal of              Proceedings of the 5th Workshop on Emotion in Human-Computer Interaction-Real
     Machine Learning Research 17, 184 (2016), 1–5.                                              world challenges. 30–37.
 [2] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere.             [30] Saúl Vargas. 2011. New approaches to diversity and novelty in recommender
     2011. The Million Song Dataset. In Proceedings of the 12th International Conference         systems. In Fourth BCS-IRSG symposium on future directions in information access
     on Music Information Retrieval (ISMIR 2011).                                                (FDIA 2011), Koblenz, Vol. 31.
 [3] Keith Bradley and Barry Smyth. 2001. Improving recommendation diversity. In            [31] Saúl Vargas and Pablo Castells. 2013. Exploiting the diversity of user preferences
     Proceedings of the Twelfth Irish Conference on Artificial Intelligence and Cognitive        for recommendation. In Proceedings of the 10th conference on open research areas
     Science, Maynooth, Ireland. 85–94.                                                          in information retrieval. 129–136.
 [4] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based              [32] Wen Wu, Li Chen, and Liang He. 2013. Using personality to adjust diversity in
     reranking for reordering documents and producing summaries. In Proceedings of               recommender systems. In Proceedings of the 24th ACM Conference on Hypertext
     the 21st annual international ACM SIGIR conference on Research and development              and Social Media. ACM, 225–229.
     in information retrieval. ACM, 335–336.                                                [33] Cai-Nicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. 2005.
 [5] Òscar Celma Herrada. 2009. Music recommendation and discovery in the long                   Improving recommendation lists through topic diversification. In Proceedings of
     tail. (2009).                                                                               the 14th international conference on World Wide Web. ACM, 22–32.