=Paper= {{Paper |id=Vol-2225/paper2 |storemode=property |title=A Diversity Adjusting Strategy with Personality for Music Recommendation |pdfUrl=https://ceur-ws.org/Vol-2225/paper2.pdf |volume=Vol-2225 |authors=Feng Lu,Nava Tintarev |dblpUrl=https://dblp.org/rec/conf/recsys/LuT18 }} ==A Diversity Adjusting Strategy with Personality for Music Recommendation== https://ceur-ws.org/Vol-2225/paper2.pdf

A Diversity Adjusting Strategy with Personality for Music
Recommendation
Feng Lu Nava Tintarev
Delft University of Technology Delft University of Technology
Delft, the Netherlands Delft, the Netherlands
F.Lu-1@student.tudelft.nl n.tintarev@tudelft.nl
ABSTRACT as to mitigate the ‘cold-start problem’. This paper combines these
Diversity-based recommender systems aim to select a wide range two branches of research together.
of relevant content for users, but diversity needs for users with To address this gap, we investigate how personality information
different personalities are rarely studied. Similarly, research on can be used to adjust the diversity degrees for people with dif-
personality-based recommender systems has primarily focused on ferent personalities in music recommender systems. Our research
the ‘cold-start problem’; few previous works have investigated how questions are therefore:
personality influences users’ diversity needs. This paper combines • RQ1: Is there an underlying relationship between people’s per-
these two branches of research together: re-ranking for diversifica- sonality and their needs for recommendation diversity in Music
tion, and improving accuracy using personality traits. Anchored domain?
in the music domain, we investigate how personality information • RQ2: What is the effect (on diversity and accuracy) of adjusting
can be used to adjust the diversity degrees for people with different the diversity degrees in Music Recommender Systems based on
personalities. We proposed a personality-based diversification algo- users’ personality information?
rithm to help enhance the diversity adjusting strategy according to First, to address RQ1 we conduct a pilot user study to investi-
people’s personality information in music recommendations. Our gate whether there exists a relationship between users’ personality
offline and online evaluation results demonstrate that our proposed information and their diversity needs on music preference (Sec-
method is an effective solution to generate personalized recommen- tion 5.1). A relation model is built based on the pilot study results
dation lists that not only have relatively higher diversity as well as (Section 5.2). To address RQ2, we proposed a personality-based
accuracy, but which also lead to increased user satisfaction. diversification algorithm referred to this relation model (Section 4).
Our proposed diversification method adjusts the diversity degrees
CCS CONCEPTS adaptively in music recommendations according to users’ distinct
• Information systems → Recommender systems; personality information. Both offline (Section 5.3) and online stud-
ies (Section 5.5) evaluate the efficiency and effectiveness of our
KEYWORDS proposed algorithm. We conclude with limitations and suggestions
for future work in Sections 7 and 8.
Recommender Systems, Diversity, Personality, Music Recommen-
dation, Re-ranking 2 RELATED WORK
Existing research on personality-based and diversity-based Recom-
mender Systems are mostly separated [7]. We discuss first research
1 INTRODUCTION on diversity-based recommender systems, and then related research
As recommender systems have moved beyond accuracy for evalua- on personality-based recommender systems.
tion, metrics such as diversity and novelty have been proposed to
evaluate the quality of recommender systems [20]. Such research 2.1 Diversity-based Recommender Systems
[11, 18] also help to address the ‘filter bubble’ problem [21]. Re- Many current diversity-oriented recommender systems [9, 31, 33]
search on the diversity metric have contributed to the emergence of adopt a fixed strategy to adjust the diversity degree for all users,
diversity-based recommender systems, which endeavor to achieve in which they usually pre-defined a score function balancing the
an optimal balance between accuracy and diversity [25]. In addition, diversity and accuracy with a parameter λ and re-ranked the gen-
researchers suggest that there exists a connection between people’s erated recommendation list according to the calculated scores. For
stable personality traits and their tastes and preferences [24]. The instance, Ziegler et al. [33] proposed the topic diversification ap-
idea of personality-based recommender systems is thus proposed. proach towards balancing recommendation lists, which is a heuris-
However, current research on personality and diversity based tic algorithm based on taxonomy similarity to increase the rec-
recommender systems are mostly separated [7]. In many diversity- ommendation diversity. To balance the accuracy of suggestions
based recommender systems [9, 31, 33], researchers usually set a and the user’s extent of interest in specific topics, they defined a
fixed balance degree between accuracy and diversity for all users. weighting parameter to control the impact of two ranking lists, one
Adjusting to diversity needs for users with different personalities ranking the items that are similar to user’s attribute-based prefer-
are rarely studied. Similarly, research on personality-based recom- ence and the other ranking the items in reverse. Vargas et al. [31]
mender systems [14, 28] has utilized personality information to also proposed a similar re-ranking diversification method based
improve the calculation of user similarity in recommendations so on sub-profiles of users, in which they also used a parameter λ to
IntRS Workshop, October 2018, Vancouver, Canada Feng Lu and Nava Tintarev

control the balance between the initial ranking score and diversity recommendations. However, our research are dissimilar w.r.t. do-
score. Building on this work, Di Noia et al adjusted the diversity main difference and the algorithms that were applied.
function adaptively according to users’ diversity inclination [9],
which is calculated as Entropy from user preferences. The balancing 3 DIVERSITY AND PERSONALITY IN
parameter λ in their objective function is still fixed for all users, RECOMMENDATIONS
which means that although their diversity function might be ad-
Before we proceed into our research steps, we first explain two key
justed properly, the recommendation balance between similarity
concepts as they are defined in our paper: diversity and personality.
and diversity is still the same for all users. All these work success-
In this section, we mainly focus on the diversity metric, the person-
fully increased the diversity in recommendations, while they rarely
ality model and the corresponding extraction methods we applied
consider that different users with different personalities may have
in our research.
different diversity needs.

2.2 Personality-based Recommender Systems 3.1 Diversity
Recent works have explored the relationship between personality Diversity is usually considered as the inverse of similarity, which
traits and user preferences in recommender systems [15, 28, 29]. refers to recommending a diverse set of items that are different from
Studies also show that personalities influence human decision mak- each other to users [20]. This concept has been introduced into the
ing process and interests for music and movies [6, 13, 24], which field of recommender systems as one of the possible solutions to
implies that personality information should be considered if we address the over-fitting problem and to increase users’ satisfaction.
want to deliver personalized recommendations. While, in another In this paper, we focus on Intra-List Diversity (ILD). Research that
aspects, since users’ attitudes towards new or diverse experiences is focused on the definition and evaluation of the Intra-List Diversity
vary considerably [26], personality can also be considered as a starts with Bradley and Smyth [3], who define the diversity as the
key aspect when incorporate novelty and diversity into recommen- averaged pairwise distance (dissimilarity) between all items in the
dations, which means that the degree of diversity in presenting recommendation set, which can be calculated as follows:
Ín Ín
i=1 j=i (1 − Similarity(c i , c j ))
recommended items can also be personalized.
In contrast, most of the research work [13, 15] in personality- ILD(R) = (1)
n ∗ (n − 1)/2
based recommender systems is designed to improve the user sim-
ilarity calculation in recommendations to address the cold-start where c 1 ..c n are items in a set of recommendation list and R is the
problem. Diversity degrees are usually the same for all users. They recommended list. Other metrics are also raised, such as Vargas et
rarely consider that different users might also possess different atti- al.’s ILD metric [30] or Gini-coefficient measurement [10]. In this
tudes towards the diversity of items, which means that personality work we mainly refer to Equation 1 for the diversity metric.
information can also be useful when adjusting diversity degrees
in Recommender Systems. As some recent studies have already 3.2 Personality
shown that personality can affect people’s needs for diversity de- Personality represents people’s differences in their enduring emo-
grees for items either in movie recommendations [6, 32] or book tional, interpersonal, experiential, attitudinal and motivational styles
recommendations [26], people with different personalities may also [17]. For the last few decades, a number of personality models
need recommendations with different diversity degrees for music and acquisition methods have been proposed. In this research, we
recommendations. mainly focus on the Big-Five Factor Model and the explicit acquisi-
tion methods (specifically, Ten-Item Personality Inventory).
2.3 Addressed Research Gap
This paper addresses the gaps in the two research branches, by com- Personality Model. We adopted one of the most commonly used
bining them. The main contributions of our research are threefold: personality model called the Big-Five Factor Model (FFM) [19],
which defines personality as five factors: Openness to Experience
• We investigated the relation between users’ personality fac-
(O), Conscientiousness (C), Extroversion (E), Agreeableness (A), and
tors and their diversity needs on music preference and found
Neuroticism (N). Usability of this model in recommender systems
that there exist certain positive correlations between these
can be found in Recommender Systems Handbook [27].
two factors.
• We proposed a personality-based re-ranking diversification Extraction Method. Current acquisition methods for personality
algorithm, which can adaptively set different diversity levels can be classified into two groups: explicit methods (using ques-
for user based on their personalities in music recommenda- tionnaires) and implicit methods (extract personality from social
tions. networks). We used the explicit method considering that it is more
• We evaluated this strategy in both online and offline studies, accurate than the implicit methods [27]. Specifically, we adopted a
which suggest that this approach is effective for improving short personality test called Ten Item Personality Inventory (TIPI)
both diversity, accuracy, and also user satisfaction. [12] since it needs less time for users to finish. In TIPI, each person-
To the best of our knowledge, in the music domain, we are the ality factor of FFM is assessed by two questions. For instance, ex-
first to conduct such systematic user study on the correlation be- traversion is assessed by ‘Extraverted, enthusiastic’ and ‘Reserved,
tween personality and users’ diversity needs. In the movie domain, quiet’. Each question (ten in total) can be rated from 1 to 7, which
Wu and Chen et al. [6, 32] conducted a similar research on movie can be then mapped into five personality factor scores. The scores
IntRS Workshop, October 2018, Vancouver, Canada Feng Lu and Nava Tintarev

Algorithm 1 The Diversification Algorithm to generate the re-
ranked list R from the original list O
Input: (Original Recommendation List O (length: 5N), target list
size N, personality-related parameters λ, θ 1 , θ 2 , ..., θ n )
Output: Top-N re-ranked list R
1: R(1) ⇐ O(1)
2: while |R| < N: do
Divover all (c, R) = i=1,2, ..,n θ i ∗ Divi (c, R)
Í
3:
4: c ∗ = argminc ∈O \R Obj(c, R) = Sim(c, P) ∗ (1 − λ) + λ ∗
Divover all (c, R)
5: R = R ∪ {c ∗ }
6: O = O \ {c ∗ }
7: end while
8: return R

Figure 1: Research Steps
be much larger than the final re-ranked list (with N items). In our
algorithm, we use 5N items for the input list.
The balancing parameter λ in the objective function (line 4, Al-
of each factor can be further mapped into four different personality gorithm 1) is controlled by personality factors in our algorithm.
levels: Low, Medium Low, Medium High, and High [12]. To adjust the diversity degrees more flexibly, we also introduce
parameters θ 1 , θ 2 , .., θ n to control the computation of the overall
4 PERSONALITY-BASED RE-RANKING diversity. All of these two kinds of parameters (λ, θ 1 , θ 2 ,..., θ n ) are
affected by the personality factors.
DIVERSIFICATION
In this section, we discuss the core of our work: the personality- 4.1 Objective Function
based diversification algorithm, which consists of a) an objective
The core of the algorithm lies in the re-ranking objective function in
function, and b) personality related parameters.
line 4 (Algorithm 1), which is referred from the Maximal Marginal
In a later section we will describe the pilot user study in which we
Relevance (MMR) [4]:
identified the relationship between users’ personality information
and their diversity needs on music preference (Section 5.1). The
results of that pilot study inform the parameters for both a) and b) Obj(c, R) = Sim(c, P) ∗ (1 − λ) + λ ∗ Divover all (c, R) (2)
above. Figure 1 outlines our overall research methodology, including
The left part of the function Sim(c, P) considers the similarity
offline and online studies to evaluate our algorithm. .
aspect of the item c to users’ initial interests P. In our work, we
Normally, the recommendation process of a recommender sys-
computed the similarity values as the rank of item c in the final list
tem can be divided into two steps: first the system generates the
according to their predicted ratings sorted in the descending order.
predicted values for all unrated items for each user and secondly
We did not use the predicted ratings directly considering that such
these items are sorted in descending order according to their pre-
predicted values may not be available for all recommender systems
dicted values. While in order to improve the diversity degrees of
(e.g. Spotify). Thus, our Sim(c, P) function becomes:
the recommendations, we use re-ranking as an improvement to
the second step. We borrow the idea of the Topic Diversification Sim(c, P) = Rank(c, O) (3)
method presented in Ziegler et al.’s work [33]. Specifically, greedy where Rank(c, O) represents the rank of item c in the original rec-
heuristics are used in our work, which have been demonstrated to ommendation list O generated by some recommendation algorithm.
be efficient and effective [9, 33]. The diversification algorithm is The other part of the function Divover all (c, R) defines the over-
shown in Algorithm 1. all diversity degree of the item c compared with the items so far
This greedy algorithm will iteratively select an item from the selected in the re-ranked list R. Here, we define the overall diver-
original list O (generated directly from a recommender system) and sity as the weighted combination of several diversity degrees for
then puts it at the end of the current re-ranked list R until the size different attributes (e.g. track attributes like artists, genres in music
of R meets a size N (N=10 in our case) and the re-ranking process recommendation). As shown in line 3 (Algorithm 1), the diversity
is complete. The core of the algorithm lies in the objective function function is defined as follows:
(line 4, Algorithm 1) which controls the balance between similarity Õ
and diversity, so that at each re-ranking step, the algorithm can Divover all (c, R) = θ i ∗ Divi (c, R) (4)
pick the next item that minimizes the objective function as the next i=1,2, ...,n
item to be placed at the end of the current diversified re-ranked list. where n represents the total number of attributes we used for
The target list is a re-ranked list with N top-ranked items (called computing the overall diversity Divover all (c, R), θ i represents the
Top-N items). In order to perform the re-ranking algorithm to make weight for each attribute diversity degree. Divi (c, R) represents the
the re-ranked list diverse enough, the size of the input list should different diversity degrees for different attributes, which is defined
IntRS Workshop, October 2018, Vancouver, Canada Feng Lu and Nava Tintarev

Table 1: Mapping from Personality Factor Level to Personal- Table 2: Demographic profiles of the pilot study (numbers
ity Related Parameters in the bracket stand for the total number of users).

Personality Factor Level Low Medium Low Medium High High Age ≤20 (5); 21-30 (83); 31-40 (32); 41-50 (18); 51-60 (5); ≥ 60 (5)
Gender Male (96); Female (47); Not tell (5)
λ/θ 1 /θ 2 /.../θ n 0.2 0.4 0.6 0.8
Nationality Asia (53); Europe (38); South America (42); North America (12); Africa (3)
Education Level Graduate School (83); College (45); High School (20); Others(2)

as ILD (equation 1). In our experiment, we used three attributes 5.1 Pilot study
(n=3) that are closely correlated with the personality factors found
To address our first research question, we conducted a pilot study, in
in our pilot user study, which we will introduce later in Section 5.2.
which we collected users’ personality information and their music
The function of the control parameter λ will be explained in
preferences (preferred songs). We designed a website 1 for the user
Section 4.2 and at the end of Section 5.2.
survey. The survey contains four main parts:
• User’s basic information: Collecting users’ demographic
4.2 Personality Related Parameters information such as their age range and gender.
For now, we have defined our similarity function and diversity func- • Personality test: The personality test in our pilot study is
tion. But we still have not incorporated the personality information. conducted via the TIPI, in which users need to answer ten
In our algorithm, the influence of the personality factors is exerted self-assessment questions. Each question should be rated
on the parameters (λ, θ 1 , θ 2 ,..., θ n ) in our objective function. from 1 to 7, from ‘Disagree strongly’ to ‘Agree strongly’ (e.g.,
Parameter λ affects the balance between similarity and diversity I see myself as extraverted, enthusiastic).
directly, thus it controls the degree of overall diversity needs. Pa- • Music preference collection: Users’ music preference is
rameters θ 1 , θ 2 ,..., θ n control the specific attribute diversity degrees collected by means of Spotify Web API, with which users
accordingly. As mentioned in Section 3, each personality factor can are asked to provide at least 20 preferred songs that they
be divided into four levels: Low, Medium Low, Medium High, normally listen to and can best describe their music taste.
and High. For each possible correlation between personality fac- Users are also asked to rate their selected songs from 1 to 5
tors and overall/attribute diversity needs, we define their mapping (least preferred to most preferred).
function as follows in Table 1. • User comments: A free-text comment section is included.
For θ 1 , θ 2 ,..., θ n , we take one more computation step: normaliza-
tion. Thus, the final θ 1 /θ 2 /.../θ n are computed as follows: 5.2 Pilot Study Results
We spread the survey via two channels: Crowdsourcing platforms
θi
θi = Í , i = 1, 2, .., n (5) and students at several universities (e.g., TU Delft, Netherlands;
j=1,2, ...,n θ j EPFL, Switzerland; and Lanzhou University, China). The majority
(around 80%) of the participants are recruited from Crowdflower
Noted that, in order to conduct the mapping, we need to know (now called Figure Eight) 2 . To ensure the quality of the data col-
the correlation between each personality factor and users’ over- lected, we also inserted some test questions into the survey to help
all/attribute diversity needs beforehand. Parameter λ is decided by us filter suspicious responses. On the Crowdflower platform, work-
the personality factor that has a positive correlation with the overall ers also need to submit their contributor ids and verification codes
diversity needs (e.g. in our case, it is Emotional Stability). While which are displayed at the end of the survey. These verification
parameters θ 1 , θ 2 ,..., θ n are decided by the personality factors that methods helped us remove a number of irresponsible participants,
are correlated with the attribute diversity needs. The specific corre- especially from the Crowdsourcing platform. Results for the user
sponding personality factors for each parameter for the mappings survey are shown below.
will be shown in Section 5.2.
Participants. 148 participants were recruited to participate in
the survey, the demographic properties of these participants are
5 EXPERIMENT shown in Table 2.
Following our research steps in Figure 1, we first conducted a pilot
study to explore the possible correlation between users’ personality Relation between Personality Factors and Single Attribute
factors and the diversity needs on their music preferences. Our Diversity of Music Preference. When studying the correlation
diversity adjusting strategy (in Section 4) is thus based on the find- between personality factors and each attribute’s diversity degrees,
ings in the pilot study. To evaluate the efficiency and effectiveness we first calculated the personality scores for each user from the TIPI
of our proposed personality-based diversification algorithm, we question scores. Then, we computed the diversity scores for each
conducted both offline and online evaluation. For the page limita- attribute within the list of tracks a user has selected using the ILD
tion, we will discuss our pilot study and offline evaluation briefly. (Equation 1) metric. For each track, we have chosen six attributes to
Results for both pilot study and offline evaluation will be shown in compute specific diversity degrees: Release Times, Artists, Number of
this section. We will show the results for the online evaluation in 1 Available at: https://music-rs-personality.herokuapp.com
the next section. 2 Crowdflower: https://www.figure-eight.com
IntRS Workshop, October 2018, Vancouver, Canada Feng Lu and Nava Tintarev

Table 3: Spearman Correlation coefficient between per- Table 4: Spearman Correlation coefficient between person-
sonality factors/demographic values and diversity degrees ality factors/demographic values and overall diversity (**p-
w.r.t. single attribute (*p-value<0.05 and **p-value<0.01). E: value<0.01). E: Extraversion, A: Agreeableness, C: Conscien-
Extraversion, A: Agreeableness, C: Conscientiousness, ES: tiousness, ES: Emotional Stability, O: Openness.
Emotional Stability, O: Openness.
E A C ES O Gender Age
E A C ES O Gender Age
Overall_Div1 0.11 0.09 0.08 0.31** 0.03 0.01 -0.05
Div(Release times) -0.03 -0.12 0.01 0.11 -0.15 0.00 0.28**
Div(Artists) 0.10 0.09 0.11 0.22** -0.04 -0.03 -0.16
Overall_Div2 0.11 0.08 0.08 0.28** 0.01 0.00 -0.10
Div(Artists number) 0.00 0.25** 0.13 0.15 0.07 0.06 -0.14 Overall_Div3 0.12 0.06 0.07 0.29** 0.02 0.00 -0.09
Div(Genres) 0.07 0.00 -0.01 0.25** 0.06 0.06 0.03
Div(Tempo) 0.11 0.09 0.11 0.24** 0.08 -0.17* -0.02
5.3 Offline Evaluation
Div(Key) 0.21** 0.05 0.06 0.17* 0.08 -0.13 -0.10
Since our diversification algorithm is built upon a re-ranking algo-
rithm (a diversification method by re-ordering the recommendation
Artists, Genres, and two audio features (Tempo and Key). Spearman’s list), its final diversity degree is affected by some re-ranking related
rank correlation coefficient was used to calculate the correlation parameters such as the size of the final top-N re-ranked list (N). The
between the five personality factors and the diversity scores for each personality related parameters (λ, θ 1 , θ 2 , θ 3 , see Section 4.2) will
attribute. In addition, considering that some demographic values also greatly influence the final diversity degrees of the recommen-
might also have some impact on the diversity needs for users when dation lists. Thus, we have conducted a series of offline evaluations
delivering recommendations, we also included two demographic to test the influence of different parameters. The parameters we
values (age and gender) in the correlation comparison. Results are tested are:
shown in Table 3. • The size of the final top-N re-ranked list (N).
Relation between Personality Factors and Overall Diversity. • The size of the input list (LS).
Besides studying the correlation between the personality factors • The size of the unrated items used for recommendation (K).
and diversity scores for single attribute, we also computed the cor- • Personality related parameters λ.
relation between the overall diversity and user’s personality values. In order to generate initial recommendations with high quality,
Considering that different users usually place different weights we used a state-of-the-art recommendation algorithm called Fac-
on attributes (e.g. some user may consider that the diversification torization Machine (specifically, fastFM [1]) [23]. To train the FM
of Artists is the most important), we assigned three different sets sufficiently, we combined our pilot study dataset (148 users’ data in
of weights to the six attributes (Release Times, Artists, Number of Section 5.2) with a complementary dataset with much larger user
Artists, Genres, Tempo, Key) in reference to [16]: Overall_Div1: data: The Echo Nest Taste Profile Subset (TPS) 3 [2]. We made a
‘Equal weights method’ (1/6, 1/6, 1/6, 1/6, 1/6, 1/6); Overall_Div2: few data selection beforehand. We first ruled out those tracks that
‘Rank-order centroid (ROC) weights’ (0.41, 0.24, 0.16, 0.10, 0.06, have only been listened to once. Then we ruled out those users
0.03); Overall_Div3: ‘Rank-sum (RS) weights’ (0.29, 0.24, 0.19, 0.14, who listened to fewer than 100 tracks in total. The TPS dataset only
0.09, 0.05). contains track play counts. We further mapped the play counts
From Table 3 and 4, we concluded four important correlations. into the integer ratings (1-5) using the rating mapping algorithm
For single attribute diversity, we find: mentioned in [5].
We then first split our pilot study dataset into two subsets: train-
• C1. Personality factor Extraversion has a positive correlation
ing Set M 1 and testing set T. T contains the top-5 rated tracks
with the diversity degree of Key.
(ratings all ≥ 4) for each user, which we will consider as the rele-
vant items to each user. The remaining user data of the pilot study
• C2. Personality factor Agreeableness has a positive correla-
dataset (M 1 ) is combined with the TPS subset (M 2 ) to form our
tion with the diversity degree of Artists Number.
whole training set M. After training the FM, we used this FM to
generate recommendations for users in the testing set T.
• C3. Personality factor Emotional Stability has a positive cor-
relation with the diversity degrees of Artist, Genre and Tempo. Hit Rate. The first metric we used in our offline evaluation was
We also find that: C4. Personality factor Emotional Stability has an accuracy measure. Hit rate was chosen due to the large item
a positive correlation with the overall diversity degree. count (number of distinct tracks) and the small number of listening
history per user [8]. Instead of using all unseen items (all items
These correlations can then be used to map the parameters in not used for training for each user) for prediction and counting the
our diversification algorithm (c.f., Section 4.2). Specifically, λ is ad- number of ‘hits’ (relevant items) in the top-N list, in our testing
justed according user’s Emotional Stability level. We used three method, each relevant item (known top-5 rated relevant items for
attribute diversity in the later experiment (see Section 5.5): Genre, each user) in the Testing Set is evaluated separately by combining
Artists Number, and Key. Thus, θ 1 , θ 2 , and θ 3 are adjusted according it with K (we used K=100) other items that this user has not rated.
to user’s Emotional Stability, Agreeableness and Extraversion 3 The Echo Nest Taste profile subset: http://labrosa.ee.columbia.edu/millionsong/
respectively. tasteprofile, extracted in July, 2018
IntRS Workshop, October 2018, Vancouver, Canada Feng Lu and Nava Tintarev

Table 5: Comparison of the two lists on the accuracy (Hit
rate) and diversity (ILD) for N=10, LS=50, K=100.

Initial List Re-ranked List
Hit rate@10 0.043 0.141
ILD@10 0.390 0.483

We assume that these unrated items will not be of interest to user Figure 2: Example of the two recommendation lists shown
u, representing the irrelevant items. The task of the FM is then to to users. One is the initial list and the other is the re-ranked
rank these K+1 items for each user. For each user, we generate the list (in random order). Users can click on the button to have
two recommendation lists: initial recommendation list (top-N items a preview for each track. Users also need to choose whether
from the initial list generated by FM) and our re-ranked list. We they like the track or not. The first two tracks are shown in
then check whether this item is in the two lists. If in, we consider this figure. In total, there are ten tracks in each list.
it as hit, if not, we consider it as miss. This process is repeated
for each item in the Testing Set. The final hit rate is computed as:
H (N ) = #hit/|T |. advance. In our online evaluation, we used Spotify Recommenda-
tion System based on their open Web APIs 5 in order to provide
ILD. We also compare the diversity degrees for both recommen- real-time recommendations. User interests are represented as ‘seed
dation lists using intra-list diversity. information’ in Spotify Recommendation. Three kinds of seed infor-
mation are used: artists, tracks, and genres. Spotify has a restriction
5.4 Offline Evaluation Results on the total number of input seeds, which is maximally 5. To ensure
Our offline evaluation results show that, for both N and LS (K is that the originally generated recommendation list (which has 100
fixed to 100), the hit rate for both lists will increase when N and LS tracks) is already diverse enough, we use at least 1 artist seed, 1
increases (hit rate for our re-ranked list is always higher than the track seed, and 1 genre seed for every recommendation.
initial list). Diversity degrees also increase when we increase the
5.5.2 Independent Variables. After we obtain the two mate-
two parameters (ILD for our re-ranked list is always higher). For
rials from users, we then generate the recommendations for them.
parameter K, results show that both hit rate and ILD drop when
In the evaluation, similar to offline evaluation, we generate two
we increase K. For the personality related parameter λ, we find
recommendation lists (initial list and re-ranked list) for each user,
that both the hit rate and ILD values will increase when we keep
each list contains 10 tracks. We adopted a within-subjects experi-
increasing λ.
mental design where the two recommendation lists are displayed
After separately evaluating the influence of these parameters, we
to the users at the same time (see Figure 2). Thus, the indepen-
then made a final comparison on the two lists. Results are shown
dent variables here are the two recommendation lists. The order of
in Table 5. We see that our re-ranked list outperforms the initial
presentation was balanced between participants.
list both in hit rate and ILD.
5.5.3 Dependent Variables.
5.5 Online Evaluation Precision@10. In order to directly measure the precision of the
Considering that offline evaluation metrics cannot always reflect the recommendations, we ask the users to rate each track as ‘Like’ or
actual user satisfaction for recommendations in real life. To further ‘Dislike’. Tracks rated as ‘Like’ are considered as relevant items. The
evaluate whether our personality-based diversification algorithm Precision@10 for each list is computed as proportion of relevant
can really enhance user satisfaction and users’ perception of list items in the whole list.
diversity, we therefore conducted the following online evaluation.
Similar to our pilot study (in Section 5.1), we also constructed a Diversity. For both lists, we also used ILD (Equation 1) to com-
website 4 for the evaluation. pute the diversity degrees.

5.5.1 Materials. Two materials are needed from the users be- User Feedback. In addition to calculating the precision and ILD
forehand: the Personality Profile and the User Interests. for each recommendation list, we also ask user for some feedback
on the two lists via a post-task questionnaire. Each user needs to
Personality Profile. We still adopted the Big-Five Factor Model as express their opinions on both lists in terms of the following three
the basic personality model in our system. Ten Items Personality main aspects:
Inventory (TIPI) is also used to extract these five personality factors • Recommendation Quality (Q1 & Q2): “The items in List A/B
from users. recommended to me matched my interests.”
User Interests & Recommendation. To generate the initial recom- • Recommendation Diversity (Q4 & Q5): “The items in List A/B
mendation list, we request users to offer their music interests in recommended to me are diverse.”
5 Spotify Recommendation: https://developer.spotify.com/documentation/web-api/
4 Available at https://music-rs-personality-online.herokuapp.com
reference/browse/get-recommendations/
IntRS Workshop, October 2018, Vancouver, Canada Feng Lu and Nava Tintarev

• User Satisfaction (Q7 & Q8): “Overall, I am satisfied with the Table 6: Demographic profiles of 25 participants for the on-
Recommendation List A/B” line evaluation.
All of these questions are referred to the ResQue User-Centric
Evaluation Framework [22], which are are responded on a 5-point Gender Male (13); Female (8); Prefer Not to Answer (4)
Likert scale, from 1 to 5, meaning from "Disagree strongly" to "Agree Age 21-30 (25)
strongly". We then compute and compare the average ratings for Education College (4); Graduate School (21)
each question on both lists. Considering that users may give the
same ratings for both lists, we added two more sub-questions regard-
Table 7: Precision@10 and ILD@10 for the two lists. Pair-
ing the Recommendation Quality and Recommendation Diversity:
wise t-tests significant at p < 0.05.
• Recommendation Quality (Q3): “Which Recommendation List
is more interesting to you (match more of your interests)?”
Initial List L 1 Re-ranked List L 2
• Recommendation Diversity (Q6): “Which Recommendation
List is more diverse to you?” Precision@10 0.58 (std: 0.15) 0.668 (std: 0.14)
These two questions rated with categorical answers: “List A”, “List ILD@10 0.48 (std: 0.06) 0.57 (std: 0.07)
B”, or “Hard to tell”.
5.5.4 Procedure Design. Similar to our pilot study, four main
parts are included in the website: The user basic information, per-
sonality test, recommendation and feedback, and user comment.
The user basic information, personality test, and the last user com-
ment parts are similar. For the recommendation and feedback part,
we provide two channels to obtain users’ original interest: a) uti-
lize Spotify history; or b) Type in manually. If users choose to use
their Spotify listening history, we will use two of their top-played
artists, two of their top-played tracks, and the top-played genre for
generating the recommendations. Users can alternatively choose
to type in their interests manually. In this way, we request users to
type in at least one artist seed, one track seed, and one genre seed.
After we obtain users’ music preference, we then feed these seeds
into the Spotify recommendation system to generate the initial
recommendation list (100 tracks). The first list L 1 is constructed Figure 3: Full comparison for Recommendation Quality (Ac-
by directly taking the top-10 items from the initial list. The second curacy), Diversity and User Satisfaction. Student t-Test is
list L 2 is generated based on our personality-based diversification also used. p < 0.05.
algorithm. We select the top-50 tracks as the input list for re-ranking.
To minimize any carryover effects, we show these two lists in
random order to users (displayed as List A and List B). For each Recommendation Quality. Specifically, for recommendation
track, users can click on the play button to listen to a 30 seconds’ quality (Q1 and Q2), the average ratings for the two lists are 3.4
preview. The track name and the corresponding artist name are (initial list, std=0.98) and 4.12 (re-ranked list, std=0.65) (t=-3.00,
also shown in the list. For each track, users need to rate as ‘Like’ or p=0.004). Q3 further compares the recommendation quality of the
‘Dislike’ for both lists. After rating all the 20 tracks, users are asked two lists with categorical answers. Results show that 8.0% users
to fill in the feedback questionnaire (see Section 5.5.3). think the Initial List is better in matching their interests, 52.0% users
think the re-ranked list is better, other 42.0% users think it is hard
6 ONLINE EVALUATION RESULTS to tell (for Chi-Squared Test, statistic=7.76, p < 0.05).
To evaluate users’ actual satisfaction towards our personality-based Recommendation Diversity. Table 7 shows the Precision@10
diversification method, we conducted this online evaluation. and ILD@10 results for both lists.
For perceived recommendation diversity (Q4 & Q5), the average
6.1 Participants ratings for the two lists are 3.28 (initial list, std=0.96) and 3.92 (re-
We conducted our online evaluation with 25 participants recruited ranked list, std=0.89) (t=–2.39, p=0.02). Q6 further compares the
at a university. Participants’ ages ranged from 21-30 years old. Table recommendation diversity of the two lists with categorical answers.
6 summarizes their demographics. Results show that 16.0% users think the initial List is better in
matching their interests, 48.0% users think the re-ranked list is
6.2 Feedback Questions better, other 36.0% users think it is hard to tell (for Chi-Squared
Test, statistic=3.92, p=0.14).
Figure 3 shows the comparison of the two lists on three aspects.
We used a paired t-test for questions on a 5-point Likert scale (Q1, User Satisfaction. For user satisfaction (Q7 & Q8), the average
Q2, Q4, Q5, Q7, and Q8). And we applied Chi-Squared Test for the ratings for the two lists are 3.36 (initial list, std=0.93) and 3.92
questions with categorical answers (Q3 and Q6). (re-ranked list, std=0.97) (t=-2.03, p < 0.05).
IntRS Workshop, October 2018, Vancouver, Canada Feng Lu and Nava Tintarev

7 DISCUSSION AND LIMITATION [6] Li Chen, Wen Wu, and Liang He. 2013. How personality influences users’ needs
for recommendation diversity?. In CHI’13 Extended Abstracts on Human Factors
From the online evaluation results, we see that our re-ranked rec- in Computing Systems. ACM, 829–834.
ommendation list outperforms the initial recommendation list in all [7] Li Chen, Wen Wu, and Liang He. 2016. Personality and Recommendation Diver-
sity. In Emotions and Personality in Personalized Services. Springer, 201–225.
three aspects (recommendation quality, diversity, and user satisfac- [8] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of
tion). For the two categorical questions Q3 and Q6, results for Q3 is recommender algorithms on top-n recommendation tasks. In Proceedings of the
in line with the results shown in Figure 3. While for Q6, the p-value fourth ACM conference on Recommender systems. ACM, 39–46.
[9] Tommaso Di Noia, Vito Claudio Ostuni, Jessica Rosati, and Paolo Tomeo. 2014. An
for Chi-Square Test is larger than 0.05, which means that there is analysis of users’ propensity toward diversity in recommendations. In Proceedings
no significant difference for Q6 when we asked users which list is of the 8th ACM Conference on Recommender systems. ACM, 285–288.
more diverse to them. The reason behind this phenomenon may lies [10] Daniel M Fleder and Kartik Hosanagar. 2007. Recommender systems and their
impact on sales diversity. In Proceedings of the 8th ACM conference on Electronic
in our limited sample size. The precision of the two lists has no big commerce. ACM, 192–199.
difference (around one relevant track difference). While considering [11] Ishan Ghanmode and Nava Tintarev. 2018. MovieTweeters: An Interactive Inter-
face to Improve Recommendation Novelty. In IntRS@ RecSys.
that our algorithm has raised the diversity level of the recommen- [12] Samuel D Gosling, Peter J Rentfrow, and William B Swann. 2003. A very brief
dation at the same time, we still can say that the re-ranked list is measure of the Big-Five personality domains. Journal of Research in personality
better in users’ perspective and our personality-based diversifica- 37, 6 (2003), 504–528.
[13] Rong Hu and Pearl Pu. 2010. A study on user perception of personality-based
tion algorithm has enhanced the diversity adjusting strategy in recommender systems. User Modeling, Adaptation, and Personalization (2010),
music recommendations. 291–302.
One limitation of our research lies in the limited sample size [14] Rong Hu and Pearl Pu. 2010. Using personality information in collaborative
filtering for new users. Recommender Systems and the Social Web 17 (2010).
both in pilot study and online evaluation. If more participants are [15] Rong Hu and Pearl Pu. 2011. Enhancing collaborative filtering systems with per-
recruited in our pilot study, the correlation between personality sonality information. In Proceedings of the fifth ACM conference on Recommender
systems. ACM, 197–204.
factors and diversity needs may be stronger. Similarly, more users [16] Jianmin Jia, Gregory W Fischer, and James S Dyer. 1998. Attribute weighting
included in our online evaluation might also yield better results. methods and decision quality in the presence of response error: a simulation
Later researchers are suggested to repeat our research with more study. Journal of Behavioral Decision Making 11, 2 (1998), 85–105.
[17] Oliver P John and Sanjay Srivastava. 1999. The Big Five trait taxonomy: History,
participants. Another limitation lies in that we did not include more measurement, and theoretical perspectives. Handbook of personality: Theory and
features (e.g. more audio features like loudness) in our pilot study. research 2, 1999 (1999), 102–138.
[18] Jayachithra Kumar and Nava Tintarev. 2018. Using visualizations to encourage
blind-spots exploration. In IntRS@ RecSys.
8 CONCLUSION [19] Robert R McCrae and Oliver P John. 1992. An introduction to the five-factor
model and its applications. Journal of personality 60, 2 (1992), 175–215.
In this paper, we proposed a solution to address the research gap [20] Sean M McNee, John Riedl, and Joseph A Konstan. 2006. Being accurate is
between research in diversity-based recommender systems and not enough: how accuracy metrics have hurt recommender systems. In CHI’06
personality-based recommender systems. We proposed an algo- extended abstracts on Human factors in computing systems. ACM, 1097–1101.
[21] Eli Pariser. 2011. The filter bubble: What the Internet is hiding from you. Penguin
rithm to adjust the diversity degrees in music recommendations UK.
adaptively for users with different personalities. The adjustment [22] Pearl Pu, Li Chen, and Rong Hu. 2011. A user-centric evaluation framework for
was based on a pilot user study which explored the relationship recommender systems. In Proceedings of the fifth ACM conference on Recommender
systems. ACM, 157–164.
between users’ personality factors and their diversity needs on [23] Steffen Rendle. 2010. Factorization machines. In Data Mining (ICDM), 2010 IEEE
music preferences. To assess the effectiveness of our algorithm, we 10th International Conference on. IEEE, 995–1000.
[24] Peter J Rentfrow and Samuel D Gosling. 2003. The do re mi’s of everyday life: the
conducted both offline and online evaluations. Results suggest that structure and personality correlates of music preferences. Journal of personality
our diversification method not only increases the diversity degrees and social psychology 84, 6 (2003), 1236.
for recommendations, but it also gains more user satisfaction. [25] Barry Smyth and Paul McClave. 2001. Similarity vs. Diversity. In Proceedings of
the 4th International Conference on Case-Based Reasoning: Case-Based Reasoning
In future work, more (audio) features with a larger participant Research and Development (ICCBR ’01). Springer-Verlag, London, UK, 347–361.
pool will be studied. Instead of using the explicit personality test, [26] Nava Tintarev and Judith Masthoff. 2013. Adapting recommendation diversity to
we also plan to try implicit personality extraction method (e.g. openness to experience: A study of human behaviour. In International Conference
on User Modeling, Adaptation, and Personalization. Springer, 190–202.
via social media) in later work. Moreover, besides the re-ranking [27] Marko Tkalcic and Li Chen. 2015. Personality and Recommender Systems. Rec-
algorithm, we also plan to try different diversification strategies ommender Systems Handbook (Jan. 2015).
[28] Marko Tkalcic, Matevz Kunaver, Andrej Košir, and Jurij Tasic. 2011. Addressing
(e.g. optimization based diversification) with personality to check the new user problem with a personality based user similarity measure. In First
whether they would yield better results. International Workshop on Decision Making and Recommendation Acceptance
Issues in Recommender Systems (DEMRA 2011). 106.
[29] Marko Tkalcic, Matevz Kunaver, Jurij Tasic, and Andrej Košir. 2009. Personal-
REFERENCES ity based user similarity measure for a collaborative recommender system. In
[1] Immanuel Bayer. 2016. fastFM: A Library for Factorization Machines. Journal of Proceedings of the 5th Workshop on Emotion in Human-Computer Interaction-Real
Machine Learning Research 17, 184 (2016), 1–5. world challenges. 30–37.
[2] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. [30] Saúl Vargas. 2011. New approaches to diversity and novelty in recommender
2011. The Million Song Dataset. In Proceedings of the 12th International Conference systems. In Fourth BCS-IRSG symposium on future directions in information access
on Music Information Retrieval (ISMIR 2011). (FDIA 2011), Koblenz, Vol. 31.
[3] Keith Bradley and Barry Smyth. 2001. Improving recommendation diversity. In [31] Saúl Vargas and Pablo Castells. 2013. Exploiting the diversity of user preferences
Proceedings of the Twelfth Irish Conference on Artificial Intelligence and Cognitive for recommendation. In Proceedings of the 10th conference on open research areas
Science, Maynooth, Ireland. 85–94. in information retrieval. 129–136.
[4] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based [32] Wen Wu, Li Chen, and Liang He. 2013. Using personality to adjust diversity in
reranking for reordering documents and producing summaries. In Proceedings of recommender systems. In Proceedings of the 24th ACM Conference on Hypertext
the 21st annual international ACM SIGIR conference on Research and development and Social Media. ACM, 225–229.
in information retrieval. ACM, 335–336. [33] Cai-Nicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. 2005.
[5] Òscar Celma Herrada. 2009. Music recommendation and discovery in the long Improving recommendation lists through topic diversification. In Proceedings of
tail. (2009). the 14th international conference on World Wide Web. ACM, 22–32.